0% found this document useful (0 votes)
290 views146 pages

ds4015 Big Data Analytics Vignesh K Notes

Uploaded by

Thasnim Shanu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
290 views146 pages

ds4015 Big Data Analytics Vignesh K Notes

Uploaded by

Thasnim Shanu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
UNIT I INTRODUCTION TO BIG DATA 9

Introduction to Big Data Platform – Challenges of Conventional Systems - Intelligent data analysis
–Nature of Data - Analytic Processes and Tools - Analysis Vs Reporting - Modern Data Analytic
Tools- Statistical Concepts: Sampling Distributions - Re-Sampling - Statistical Inference - Prediction
Error.

Introduction to Big Data Platform

S
What is Big Data?

TE
Big data can be defined as a concept used to describe a large volume of data, which are both structured and
unstructured, and that gets increased day by day by any system or business. However, it is not the quantity of
data, which is essential. The important part is what any firm or organization can do with the data matters a lot.

O
Analysis can be performed on big data for insight and predictions, which can lead to a better decision and
reliable strategy in business moves.

What Are the 3Vs of Big Data?


N
This conception theory gained thrust in the early 2000s when trade and business analyst Mr. Doug Laney
K
expressed the mainstream explanation of the keyword big data over the pillars of 3v's:

 Volume: Organizations and firms gather as well as pull together different data from different sources,
H

which includes business transactions and data, data from social media, login data, as well as
information from the sensor as well as machine-to-machine data. Earlier, this data storage would have
S

been an issue - but because of the advent of new technologies for handling extensive data with tools
like Apache Spark, Hadoop, the burden of enormous data got decreased.
E

 Velocity: Data is now streaming at an exceptional speed, which has to be dealt with suitably. Sensors,
N

smart metering, user data as well as RFID tags are lashing the need for dealing with an inundation of
data in near real-time.

IG

Variety: The releases of data from various systems have diverse types and formats. They range from
structured to unstructured, numeric data of traditional databases to non-numeric or text documents,
emails, audios and videos, stock ticker data, login data, Blockchains' encrypted data, or even financial
V

transactions.

Importance of Big Data

Big Data does not take care of how much data is there, but how it can be used. Data can be taken from various
sources for analyzing it and finding answers which enable:

 Reduction in cost.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
 Time reductions.
 New product development with optimized offers.
 Well-groomed decision making.

When you merge big data with high-powered data analytics, it is possible to achieve business-related tasks like:

 Real-time determination of core causes of failures, problems, or faults.


 Produce token and coupons as per the customer's buying behavior.

S
 Risk-management can be done in minutes by calculating risk portfolios.
 Detection of deceptive behavior before its influence.

TE
Things That Comes Under Big Data (Examples of Big Data)

As you know, the concept of big data is a clustered management of different forms of data generated by various

O
devices (Android, iOS, etc.), applications (music apps, web apps, game apps, etc.), or actions (searching through
SE, navigating through similar types of web pages, etc.). Here is the list of some commonly found fields of data
that come under the umbrella of Big Data:

ADVERTISING
N
K
 Black Box Data: Black box data is a type of data that is collected from private and government
helicopters, airplanes, and jets. This data includes the capture of Flight Crew Sounds, separate
H

recording of the microphone as well as earphones, etc.


 Stock Exchange Data: Stock exchange data includes various data prepared about 'purchase' and 'selling'
S

of different raw and well-made decisions.


 Social Media Data: This type of data contains information about social media activities that include
E

posts submitted by millions of people worldwide.



N

Transport Data: Transport data includes vehicle models, capacity, distance (from source to destination),
and the availability of different vehicles.

IG

Search Engine Data: Retrieve a wide variety of unprocessed information that is stored in SE databases.
There are various other types of data that gets generated in bulk amount from applications and organizations.

Types of Big Data (Types of Data Handled by Big Data)


V

The data generated in bulk amount with high velocity can be categorized as:

1. Structured Data: These are relational data.


2. Semi-structured Data: example: XML, JSON data.
3. Unstructured Data: Data of different formats: document files, multimedia files, images, backup files,
etc.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Big Data Technologies

This technology is significant for presenting a more precise analysis that leads the business analyst to highly
accurate decision-making, ensuring more considerable operational efficiencies by reducing costs and trade risks.
Now to implement such analytics and hold such a wide variety of data, one must need an infrastructure that can
facilitate and manage and process huge data volumes in real-time. This way, big data is classified into two
subcategories:

S
 Operational Big Data: comprises of data on systems such as MongoDB, Apache Cassandra, or CouchDB,
which offer equipped capabilities in real-time for large data operations.

TE
 Analytical Big Data: comprises systems such as MapReduce, BigQuery, Apache Spark, or Massively
Parallel Processing (MPP) database, which offer analytical competence to process complex analysis on
large datasets.

O
Challenges of Big Data


it. There no 100% efficient way to filter out relevant data. N
Rapid Data Growth: The growth velocity at such a high rate creates a problem to look for insights using

Storage: The generation of such a massive amount of data needs space for storage, and organizations
K
face challenges to handle such extensive data without suitable tools and technologies.
 Unreliable Data: It cannot be guaranteed that the big data collected and analyzed are totally (100%)
H

accurate. Redundant data, contradicting data, or incomplete data are challenges that remain within it.
 Data Security: Firms and organizations storing such massive data (of users) can be a target of
S

cybercriminals, and there is a risk of data getting stolen. Hence, encrypting such colossal data is also a
challenge for firms and organizations.
E
N

Challenges of Conventional System in big data


IG

 Big data has revolutionized the way businesses operate, but it has also presented a
number of challenges for conventional systems. Here are some of the challenges faced
V

by conventional systems in handling big data:


 Big data is a term used to describe the large amount of data that can be sto red and
analyzed by computers. Big data is often used in business, science and government. Big
Data has been around for several years now, but it's only recently that people have
started realizing how important it is for businesses to use this technology i n order to
improve their operations and provide better services to customers. A lot of companies

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
have already started using big data analytics tools because they realize how much
potential there is in utilizing these systems effectively!
 However, while there are many benefits associated with using such systems - including
faster processing times as well as increased accuracy - there are also some challenges
involved with implementing them correctly.

Challenges of Conventional System in big data

S
 Scalability
 Speed

TE
 Storage
 Data Integration
 Security

O
Scalability

 N
A common problem with conventional systems is that they can't scale. As the amount of
data increases, so does the time it takes to process and store it. This can cause
K
bottlenecks and system crashes, which are not ideal for businesses looking to make
quick decisions based on their data.
 Conventional systems also lack flexibility in terms of how they handle new types of
H

information--for example, if you want to add another column (columns are like fields )
or row (rows are like records) without having to rewrite all your code from scratch.
S

Speed
E

Speed is a critical component of any data processing system. Speed is important because it
allows you to:
N

 Process and analyze your data faster, which means you can make better-informed
IG

decisions about how to proceed with your business.


 Make more accurate predictions about future events based on past performance.
V

Storage

The amount of data being created and stored is growing exponentially, with estimates that it
will reach 44 zettabytes by 2020. That's a lot of storage space!

The problem with conventional systems is that they don't scale well as you add more data.
This leads to huge amounts of wasted storage space and lost information due to corruption or
security breaches.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Data Integration

 The challenges of conventional systems in big data are numerous. Data integration is
one of the biggest challenges, as it requires a lot of time and effort to combine different
sources into a single database. This is especially true when you're trying to integrate
data from multiple sources with different schemas and formats.
 Another challenge is errors and inaccuracies in analysis due to lack of understanding of
what exactly happened during an event or transaction. For example, if t here was an

S
error while transferring money from one bank account to another, then there would be
no way for us know what actually happened unless someone tells us about it later on

TE
(which may not happen).

Security

 Security is a major challenge for enterprises that depend on conventional systems to

O
process and store their data. Traditional databases are designed to be accessed by
trusted users within an organization, but this makes it difficult to ensure that only


authorized people have access to sen sitive information.
N
Security measures such as firewalls, passwords and encryption help protect against
unauthorized access and attacks by hackers who want to steal data or disrupt
K
operations. But these security measures have limitations: They're expensive; they
require constant monitoring and maintenance; they can slow down performance if
implemented too extensively; and they often don't prevent breaches altogether
H

because there's always some way around them (such as through phishing emails).

S

Conventional systems are not equipped for big data. They were designed for a
different era, when the volume of information was much smaller and more
E

manageable. Now that we're dealing with huge amounts of data, conventional systems
are struggling to keep up. Convention al systems are also expensive and time -
N

consuming to maintain; they require constant maintenance and upgrades in order to


meet new demands from users who want faster access speeds and more features than
IG

ever before.
V

Intelligent Data Analysis Definition


Intelligent Data Analysis (IDA) is an interdisciplinary study that is concerned with the extraction of useful
knowledge from data, drawing techniques from a variety of fields, such as artificial intelligence, high-
performance computing, pattern recognition, and statistics. Data intelligence platforms and data intelligence
solutions are available from data intelligence companies such as Data Visualization Intelligence, Strategic Data
Intelligence, Global Data Intelligence.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

WhatisIntelligentDataAnalysis?
Intelligent data analysis refers to the use of analysis, classification, conversion, extraction organization, and
reasoning methods to extract useful knowledge from data. This data analytics intelligence process generally
consists of the data preparation stage, the data mining stage, and the result validation and explanation stage.

Data preparation involves the integration of required data into a dataset that will be used for data mining; data

S
mining involves examining large databases in order to generate new information; result validation involves the
verification of patterns produced by data mining algorithms; and result explanation involves the intuitive

TE
communication of results.

O
The Nature of Data
N
That’s a pretty broad title, but, really, what we’re talking about here are some fundamentally different ways to

treat data as we work with it. This topic can seem academic but it is relevant for web analysts specifically and
K
researchers broadly. Yes, this topic out to be pretty darn important when it comes time to applying statistical

operations and performing model building and testing.


H

So, we have to start with the basics: the nature of data. There are four types of data:
S

 Nominal
E

 Ordinal

 Interval
N

 Ratio
IG

Each offers a unique set of characteristics, which impacts the type of analysis that can be performed.
V

The distinction between the four types of scales center on three different characteristics:

1. The order of responses – whether it matters or not

2. The distance between observations – whether it matters or is interpretable

3. The presence or inclusion of a true zero

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Nominal Scales

Nominal scales measure categories and have the following characteristics:

 Order: The order of the responses or observations does not matter.

 Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is not the same as a 2

and 3.

S
 True Zero: There is no true or real zero. In a nominal scale, zero is uninterpretable.

TE
Consider traffic source (or last touch channel) as an example in which visitors reach our site through a mutually

exclusive channel, or last point of contact. These channels would include:

O
1. Paid Search

2.

3.
Organic Search

Email N
K
4. Display

(This list looks artificially short, but the logic and interpretation would remain the same for nine channels or for
H

99 channels.)
S

If we want to know that each channel is simply somehow different, then we could count the number of visits
E

from each channel. Those counts can be considered nominal in nature.

Suppose the counts looked like this:


N

Channel Count of Visits


IG

Paid Search 2,143


V

Organic Search 3,124

Email 1,254

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Channel Count of Visits

Display 2,077

With nominal data, the order of the four channels would not change or alter the interpretation. Suppose we,

instead, viewed the data like this:

S
Channel Count of Visits

TE
Display 2,077

O
Paid Search 2,143

Email 1,254
N
K
Organic Search 3,124
H

The order of the categories does not matter.


S

And, the distance between the categories is not relevant. Display is not four times as much as paid search and
E

organic search is not half of organic search. While there is an arithmetic relationship between these counts, that

is only relevant if we treat the scales as ratio scales (see the Ratio Scales section below).
N

Finally, zero holds no meaning. We could not interpret a zero because it does not occur in a nominal scale.
IG

Appropriate statistics for nominal scales: mode, count, frequencies


V

Displays: histograms or bar charts

Ordinal Scales

At the risk of providing a tautological definition, ordinal scales measure, well, order. So, our characteristics for

ordinal scales are:

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
 Order: The order of the responses or observations matters.

 Distance: Ordinal scales do not hold distance. The distance between first and second is unknown as is

the distance between first and third along with all observations.

 True Zero: There is no true or real zero. An item, observation, or category cannot finish zero.

Let’s work through our traffic source example and rank the channels based on the number of visits to our site,

S
with “1” being the highest number of visits:

TE
Channel Count of Visits

Organic Search 1

O
Paid Search

Display
2

3
N
K
Email 4
H
S

Again, for this example, we are limiting ourselves to four channels, but the logic would remain the same for

ranking nine channels or 99 channels.


E

By ranking the channel from most to least number of visitors in terms of last point of contact, we’ve established
N

an order.
IG

However, the distance between the rankings appears unknown. Organic Search could have one more visit

compared to Paid Search or one hundred more visitors. The distance between the two items appears unknown.
V

Finally, zero holds no meaning. We could not interpret a zero because it does not occur in an ordinal scale. An

item such as Organic Search could not maintain a zero ranking.

Appropriate statistics for ordinal scales: count, frequencies, mode

Displays: histograms or bar charts

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Interval Scales

Interval scales provide insight into the variability of the observations or data. Classic interval scales are Likert

scales (e.g., 1 - strongly agree and 9 - strongly disagree) and Semantic Differential scales (e.g., 1 - dark and 9 -

light). In an interval scale, users could respond to “I enjoy opening links to the website from a company email”

with a response ranging on a scale of values.

S
The characteristics of interval scales are:

TE
 Order: The order of the responses or observations does matter.

 Distance: Interval scales do offer distance. That is, the distance from 1 to 2 appears the same as 4 to 5.

Also, six is twice as much as three and two is half of four. Hence, we can perform arithmetic operations

O
on the data.


N
True Zero: There is no zero with interval scales. However, data can be rescaled in a manner that

contains zero. An interval scales measure from 1 to 9 remains the same as 11 to 19 because we added
K
10 to all values. Similarly, a 1 to 9 interval scale is the same a -4 to 4 scale because we subtracted 5

from all values. Although the new scale contains zero, zero remains uninterpretable because it only
H

appears in the scale from the transformation.


S

Unless a web analyst is working with survey data, it is doubtful he or she will encounter data from an interval

scales. More likely, a web analyst will deal with ratio scales (next section).
E

Appropriate statistics for interval scales: count, frequencies, mode, median, mean, standard deviation (and
N

variance), skewness, and kurtosis.


IG

Displays: histograms or bar charts, line charts, and scatter plots.

An Illustrative Side Note About Temperature


V

An argument exists about temperature. Is it an interval scale or an ordinal scale? Many researchers argue for

temperature as an interval scale. It offers order (e.g., 212∘∘ F is hotter than 32∘∘ F), distance (e.g., 40∘∘ F to

44∘∘ F is the same as 100∘∘ F to 104∘∘ F), and lacks a true zero (e.g., 0∘∘ F is not the same as 0∘∘ C). However,

other researchers argue for temperature as an ordinal scale because of the issue related to distance. 200∘∘ F is

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
not twice as 100 F. The human brain registers both temperatures as equally hot (if standing outside) or mild (if

touching a stove). Finally, we would not say that 80 F is twice as warm as 40∘∘ F or that 30∘∘ F is a third colder

as 90∘∘ F.

Ratio Scales

Ratio scales appear as nominal scales with a true zero. They have the following characteristics:

S
 Order: The order of the responses or observations matters.

TE
 Distance: Ratio scales do do have an interpretable distance.

 True Zero: There is a true zero.

O
Income is a classic example of a ratio scale:


Order is established. We would all prefer $100 to $1!
N
Zero dollars means we have no income (or, in accounting terms, our revenue exactly equals our
K
expenses!)

 Distance is interpretable, in that $20 appears as twice $10 and $50 is half of a $100.
H

In web analytics, the number of visits and the number of goal completions serve as examples of ratio scales. A
S

thousand visits is a third of 3,000 visits, while 400 goal completions are twice as many as 200 goal completions.
E

Zero visitors or zero goal completions should be interpreted as just that: no visits or completed goals (uh-oh…

did someone remove the page tag?!).


N

For the web analyst, the statistics for ratio scales are the same as for interval scales.
IG

Appropriate statistics for ratio scales: count, frequencies, mode, median, mean, standard deviation (and

variance), skewness, and kurtosis.


V

Displays: histograms or bar charts, line charts, and scatter plots.

An Important Note: Don’t let the term “ratio” trip you up. Laypeople (aka, “non-statisticians”) are taught that

ratios represent a relationship between two numbers. For instance, conversion rate is the “ratio” of orders to

visits. But, as illustrated above, that is an overly narrow definition when it comes to statistics.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Summary Cheat Sheet

The table below summarizes the characteristics of all four types of scales.

Nominal Ordinal Interval

Order Matters No Yes Yes

S
TE
Distance Is Interpretable No No Yes

Zero Exists No No No

O
Transformation

N
Did you notice that we used channel for three of our four examples? And, for all three, the underlying metric

was “visits.” What that means is that any given variable isn’t inherently a single type of data (type of scale). It
K
depends on how the data is being used.
H

What that means is that some types of scales can be transformed to other types of scales. We can convert or

transform our data from ratio to interval to ordinal to nominal. However, we cannot convert or transform our
S

data from nominal to ordinal to interval to ratio.


E

Put another way, take a look at the cheat sheet above. If you have data using one scale, you can change a “Yes”
N

to a “No” (and, thus, change the type of scale), but you cannot change a “No” to a “Yes.”

Pause here to take an aspirin as needed, should your head be starting to hurt.
IG

As an example, let’s say our website receives 10,000 visits in a month. That figure – 10,000 visits – is a ratio

scale. I could convert it to the number of visits in a week for that month (let’s pick our month as February, 2015,
V

as the first of the month fell on a Sunday and there were exactly 4 weeks in the month!):

 Week 1 had 2,000 visits

 Week 2 had 3,000 visits

 Week 3 had 1,000 visits

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
 Week 4 had 4,000 visits

We could treat these numbers as interval; specifically, an equal width interval. However, there is little reason –

conceptually or managerially – to treat these numbers as interval. So, let’s move on.

We could rank the weeks based on the number of visits, which would transform the data to an ordinal scale.

From most to least number of weekly visits:

S
 Week 4

TE
 Week 2

 Week 1

O
 Week 3

N
Finally, we could group week 2 and week 4 into “heavy traffic” weeks and group week 1 and week 3 into “light

traffic” week and we would have created a nominal scale. The order heavy-light or light-heavy would not matter
K
provided we remember the coding effort.

We started with a ratio scale that we ultimately transformed into a nominal scale. As we did so, we lost a lot of
H

information. But, by transforming this data, we can use different analytical tools to answer different types of

questions.
S
E

Analytic Processes and Tools


N

As we’re growing with the pace of technology, the demand to track data is increasing rapidly. Today,
IG

almost 2.5quintillion bytes of data are generated globally and it’s useless until that data is segregated in a
proper structure. It has become crucial for businesses to maintain consistency in the business by collecting
meaningful data from the market today and for that, all it takes is the right data analytic tool and
V

a professional data analyst to segregate a huge amount of raw data by which then a company can make the
right approach.

There are hundreds of data analytics tools out there in the market today but the selection of the right tool
will depend upon your business NEED, GOALS, and VARIETY to get business in the right direction. Now, let’s
check out the top 10 analytics tools in big data.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Big data is the storage and analysis of large data sets. These are complex data sets which can be both structured
or unstructured. They are so large that it is not possible to work on them with traditional analytical tools. These
days, organizations are realising the value they get out of big data analytics and hence they are deploying big data
tools and processes to bring more efficiency in their work environment. They are willing to hire good big data
analytics professionals at a good salary. In order to be a big data analyst, you should get acquainted with big data
first and get certification by enrolling yourself in analytics courses online.

1. APACHE Hadoop

S
TE
It’s a Java-based open-source platform that is being used to store and process big data. It is built on a cluster
system that allows the system to process data efficiently and let the data run parallel. It can process both
structured and unstructured data from one server to multiple computers. Hadoop also offers cross-
platform support for its users. Today, it is the best big data analytic tool and is popularly used by many tech

O
giants such as Amazon, Microsoft, IBM, etc.
Features of Apache Hadoop:



Offers quick access via HDFS (Hadoop Distributed File System). N
Free to use and offers an efficient storage solution for businesses.

Highly flexible and can be easily implemented with MySQL, and JSON.
K
 Highly scalable as it can distribute a large amount of data in small segments.
 It works on small commodity hardware like JBOD or a bunch of disks.
H

2. Cassandra
S

APACHE Cassandra is an open-source NoSQL distributed database that is used to fetch large amounts of data.
E

It’s one of the most popular tools for data analytics and has been praised by many tech companies due to its
high scalability and availability without compromising speed and performance. It is capable of delivering
N

thousands of operations every second and can handle petabytes of resources with almost zero downtime. It
was created by Facebook back in 2008 and was published publicly.
IG

Features of APACHE Cassandra:


 Data Storage Flexibility: It supports all forms of data i.e. structured, unstructured, semi-
structured, and allows users to change as per their needs.
V

 Data Distribution System: Easy to distribute data with the help of replicating data on multiple
data centers.
 Fast Processing: Cassandra has been designed to run on efficient commodity hardware and also
offers fast storage and data processing.
 Fault-tolerance: The moment, if any node fails, it will be replaced without any delay.

3. Qubole

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
It’s an open-source big data tool that helps in fetching data in a value of chain using ad -hoc analysis in
machine learning. Qubole is a data lake platform that offers end-to-end service with reduced time and effort
which are required in moving data pipelines. It is capable of configuring multi-cloud services such as AWS,
Azure, and Google Cloud. Besides, it also helps in lowering the cost of cloud computing by 50%.

Features of Qubole:
 Supports ETL process: It allows companies to migrate data from multiple sources in one place.
 Real-time Insight: It monitors user’s systems and allows them to view real-time insights

S
 Predictive Analysis: Qubole offers predictive analysis so that companies can take actions
accordingly for targeting more acquisitions.

TE
 Advanced Security System: To protect users’ data in the cloud, Qubole uses an advanced security
system and also ensures to protect any future breaches. Besides, it also allows encrypting cloud
data from any potential threat.

O
4. Xplenty

N
It is a data analytic tool for building a data pipeline by using minimal codes in it. It offers a wide range of
solutions for sales, marketing, and support. With the help of its interactive graphical interface, it provides
K
solutions for ETL, ELT, etc. The best part of using Xplenty is its low investment in hardware & software and its
offers support via email, chat, telephonic and virtual meetings. Xplenty is a platform to process data for
analytics over the cloud and segregates all the data together.
H

Features of Xplenty:
 Rest API: A user can possibly do anything by implementing Rest API
S

 Flexibility: Data can be sent, and pulled to databases, warehouses, and salesforce.
 Data Security: It offers SSL/TSL encryption and the platform is capable of verifying algorithms
E

and certificates regularly.


 Deployment: It offers integration apps for both cloud & in-house and supports deployment to
N

integrate apps over the cloud.


IG

5. Spark

APACHE Spark is another framework that is used to process data and perform numerous tasks on a large
V

scale. It is also used to process data via multiple computers with the help of distributing tools. It is widely
used among data analysts as it offers easy-to-use APIs that provide easy data pulling methods and it
is capable of handling multi-petabytes of data as well. Recently, Spark made a record of processing 100
terabytes of data in just 23 minutes which broke the previous world record of Hadoop (71 minutes). This is the
reason why big tech giants are moving towards spark now and is highly suitable for ML and AI today.
Features of APACHE Spark:

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
 Ease of use: It allows users to run in their preferred language. (JAVA, Python, etc.)
 Real-time Processing: Spark can handle real-time streaming via Spark Streaming
 Flexible: It can run on, Mesos, Kubernetes, or the cloud.

6. Mongo DB

Came in limelight in 2010, is a free, open-source platform and a document-oriented (NoSQL) database that is

S
used to store a high volume of data. It uses collections and documents for storage and its document consists
of key-value pairs which are considered a basic unit of Mongo DB. It is so popular among developers due to

TE
its availability for multi-programming languages such as Python, Jscript, and Ruby.
Features of Mongo DB:
 Written in C++: It’s a schema-less DB and can hold varieties of documents inside.
 Simplifies Stack: With the help of mongo, a user can easily store files without any disturbance in

O
the stack.
 Master-Slave Replication: It can write/read data from the master and can be called back for
backup.

7. Apache Storm
N
K
A storm is a robust, user-friendly tool used for data analytics, especially in small companies. The best part
about the storm is that it has no language barrier (programming) in it and can support any of them. It was
H

designed to handle a pool of large data in fault-tolerance and horizontally scalable methods. When we talk
about real-time data processing, Storm leads the chart because of its distributed real-time big data
S

processing system, due to which today many tech giants are using APACHE Storm in their system. Some of
E

the most notable names are Twitter, Zendesk, NaviSite, etc.


Features of Storm:
N

 Data Processing: Storm process the data even if the node gets disconnected
 Highly Scalable: It keeps the momentum of performance even if the load increases
IG

 Fast: The speed of APACHE Storm is impeccable and can process up to 1 million messages of 100
bytes on a single node.
V

8. SAS

Today it is one of the best tools for creating statistical modeling used by data analysts. By using SAS, a data
scientist can mine, manage, extract or update data in different variants from different sources. Statistical
Analytical System or SAS allows a user to access the data in any format (SAS tables or Excel worksheets).
Besides that it also offers a cloud platform for business analytics called SAS Viya and also to get a strong grip
on AI & ML, they have introduced new tools and products.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Features of SAS:
 Flexible Programming Language: It offers easy-to-learn syntax and has also vast libraries which
make it suitable for non-programmers
 Vast Data Format: It provides support for many programming languages which also include SQL
and carries the ability to read data from any format.
 Encryption: It provides end-to-end security with a feature called SAS/SECURE.

S
9. Data Pine

TE
Datapine is an analytical used for BI and was founded back in 2012 (Berlin, Germany). In a short period of
time, it has gained much popularity in a number of countries and it’s mainly used for data extraction (for
small-medium companies fetching data for close monitoring). With the help of its enhanced UI design,
anyone can visit and check the data as per their requirement and offer in 4 different price brackets, starting

O
from $249 per month. They do offer dashboards by functions, industry, and platform.

Features of Datapine:

N
Automation: To cut down the manual chase, datapine offers a wide array of AI assistant and BI
tools.
K
 Predictive Tool: datapine provides forecasting/predictive analytics by using historical and current
data, it derives the future outcome.
 Add on: It also offers intuitive widgets, visual analytics & discovery, ad hoc reporting, etc.
H

10. Rapid Miner


S

It’s a fully automated visual workflow design tool used for data analytics. It’s a no-code platform and users
E

aren’t required to code for segregating data. Today, it is being heavily used in many industries such as ed-
tech, training, research, etc. Though it’s an open-source platform but has a limitation of adding 10000 data
N

rows and a single logical processor. With the help of Rapid Miner, one can easily deploy their ML models to
the web or mobile (only when the user interface is ready to collect real-time figures).
IG

Features of Rapid Miner:


 Accessibility: It allows users to access 40+ types of files (SAS, ARFF, etc.) via URL
 Storage: Users can access cloud storage facilities such as AWS and dropbox
V

 Data validation: Rapid miner enables the visual display of multiple results in history for better
evaluation.
Conclusion
Big data has been in limelight for the past few years and will continue to dominate the market in almost
every sector for every market size. The demand for big data is booming at an enormous rate and ample tools

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
are available in the market today, all you need is the right approach and choose the best data analytic tool as
per the project’s requirement.

Understanding the Difference Between Reporting


and Analytics

S
 The terms reporting and analytics are often used interchangeably. This is not surprising since both take
in data as “input” — which is then processed and presented in the form of charts, graphs, or

TE
dashboards.

 Reports and analytics help businesses improve operational efficiency and productivity, but in different
ways. While reports explain what is happening, analytics helps identify why it is happening. Reporting

O
summarizes and organizes data in easily digestible ways while analytics enables questioning and
exploring that data further. It provides invaluable insights into trends and helps create strategies to


N
help improve operations, customer satisfaction, growth, and other business metrics.

Reporting and analysis are both important for an organization to make informed decisions by
K
presenting data in a format that is easy to understand. In reporting, data is brought together from
different sources and presented in an easy-to-consume format. Typically, modern reporting apps
today offer next-generation dashboards with high-level data visualization capabilities. There are several
H

types of reports being generated by companies including financial reports, accounting reports,
operational reports, market reports, and more. This helps understand how each function is performing
S

at a glance. But for further insights, it requires analytics.


E

 Analytics enables business users to cull out insights from data, spot trends, and help make better
decisions. Next-generation analytics takes advantage of emerging technologies like AI, NLP, and
N

machine learning to offer predictive insights based on historical and real-time data.
IG

 To run analytics, reporting is not necessary.

 For instance, let us take a look at a manufacturing company that uses Oracle ERP to manage various
V

functions including accounting, financial management, project management, procurement, and supply
chain. For business users, it is critical to have a finger on the pulse of all key data. Additionally, specific
teams need to periodically generate reports and present data to senior management and other
stakeholders. In addition to reporting, it is also essential to analyze data from various sources and
gather insights. The problem today is people are using reporting and analytics interchangeably. When
the time comes to replace an end-of-life operational reporting tool, they are using solutions that are
designed for analytics. This would be a waste of time and resources.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
 It is critical that operational reporting is done using a tool built for that purpose. Ideally, it’ll be a self-
service tool so business users don’t have to rely on IT to generate reports. It must have the ability to
drill down into several layers of data when needed. Additionally, if you’re using Oracle ERP you need an
operational reporting tool like Orbit that seamlessly integrates data from various business systems –
both on-premise and cloud. In this blog, we look at the nuances of both operational reporting and
analytics and why it is critical to have the right tools for the right tasks.
Steps Involved in Building a Report and Preparing Data for Analytics

S
To build a report, the steps involved broadly include:

 Identifying the business need

TE
 Collecting and gathering relevant data

 Translating the technical data

O
 Understanding the data context

 Creating reporting dashboards


Enabling real-time reporting

Offer the ability to drill down into reports


N
K
For data analytics, the steps involved include:

 Creating a data hypothesis


H

 Gathering and transforming data


S

 Building analytical models to ingest data, process it and offer insights


E

 Use tools for data visualization, trend analysis, deep dives, etc.

 Using data and insights for making decisions


N

Five Key Differences Between Reporting and Analysis


IG

 One of the key differences between reporting and analytics is that, while a report involves organizing
data into summaries, analysis involves inspecting, cleaning, transforming, and modeling these reports
to gain insights for a specific purpose.
V

 Knowing the difference between the two is essential to fully benefit from the potential of both without
missing out on key features of either one. Some of the key differences include:

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
1. Purpose: Reporting involves extracting data from different sources within an organization and monitoring it to
gain an understanding of the performance of the various functions. By linking data from across functions, it
K
helps create a cross-channel view that facilitates comparison to understand data easily. An analysis is being able
to interpret data at a deeper level, interpreting it and providing recommendations on actions.
2. The Specifics: Reporting involves activities such as building, consolidating, organizing, configuring, formatting,
H

and summarizing. It requires clean, raw data and reports that may be generated periodically, such as daily,
weekly, monthly, quarterly, and yearly. Analytics includes asking questions, examining, comparing, interpreting,
S

and confirming. Enriching data with big data can help predict future trends as well.
E

3. The Final Output: In the case of reporting, outputs such as canned reports, dashboards, and alerts push
information to users. Through analysis, analysts try to extract answers using business queries and present them
N

in the form of ad hoc responses, insights, recommended actions, or a forecast. Understanding this key
difference can help businesses leverage analytics better.
IG

4. People: Reporting requires repetitive tasks that can be automated. It is often used by functional business
heads who monitor specific business metrics. Analytics requires customization and therefore depends on data
analysts and scientists. Also, it is used by business leaders to make data-driven decisions.
V

5. Value Proposition: This is like comparing apples to oranges. Both reporting and analytics serve a different
purpose. By understanding the purpose and using them correctly, businesses can derive immense value from
both.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
Orbit for both Reporting and Analytics

N
Orbit Reporting and Analytics is a single tool that can be used for both generating different reports and
K
running analytics to meet business objectives. It can work in multi-cloud environments, extracting data
from the cloud and on-prem systems and presenting them in many ways as required by the user. It
enables self-service, allowing business users to generate their own reports without depending on the IT
H

team, in real-time. It complies with security and privacy requirements by allowing access only to
authorized users. It also allows users to generate reports in real-time in Excel.
S

 It also facilitates analytics, enabling businesses to draw insights and convert them into actions to
E

predict future trends, identify areas of improvement across functions, and meet the organizational goal
of growth.
N
IG

Modern Data Analytic Tools | Data Analytics


V

1. Apache Hadoop:
1. Apache Hadoop is a big data analytics tool is a Java-based free software framework.
2. It helps in the effective storage of a huge amount of data in a storage place known as a cluster.
3. It runs in parallel on a cluster and also has the ability to process huge data across all nodes in it.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
4. There is a storage system in Hadoop popularly known as the Hadoop Distributed File System
(HDFS), which helps to splits the large volume of data and distribute it across many nodes present
in a cluster.
2. KNIME:
1. KNIME analytics platform is one of the leading open solutions for data-driven innovation.
2. This tool helps in discovering the potential hidden in a huge volume of data, it also performs mine
for fresh insights, or predicts the new futures.

S
3. OpenRefine:
1. OneRefine tool is one of the efficient tools to work on the messy and large volume of data.

TE
2. It includes cleansing data and transforming that data from one format to another.
3. It helps to explore large data sets easily.
4. Orange:
1. Orange is famous for open-source data visualization and helps with data analysis for beginners and

O
as well to the expert
2. This tool provides interactive workflows with a large toolbox option to create the same which helps

5. RapidMiner:
1.
in the analysis and visualizing of data.
N
RapidMiner tool operates using visual programming and also it is much capable of manipulating,
K
analyzing and modelling the data.
2. RapidMiner tools make data science teams easier and more productive by using an open-source
platform for all their jobs like machine learning, data preparation, and model deployment.
H

6. R-programming:
1. R is a free open source software programming language and a software environment for statistical
S

computing and graphics.


E

2. It is used by data miners for developing statistical software and data analysis.
3. It has become a highly popular tool for big data in recent years.
N

7. Datawrapper:
1. It is an online data visualization tool for making interactive charts.
IG

2. It uses data files in a CSV, pdf or Excel format.


3. Datawrapper generate visualization in the form of bar, line, map etc. It can be embedded into any
other website as well.
V

8. Tableau:
1. Tableau is another popular big data tool. It is simple and very intuitive to use.
2. It communicates the insights of the data through data visualization.
3. Through Tableau, an analyst can check a hypothesis and explore the data before starting to work
on it extensively.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

Sampling Distribution Definition


Sampling distribution in statistics refers to studying many random samples collected from a given population
based on a specific attribute. The results obtained provide a clear picture of variations in the probability of the
outcomes derived. As a result, the analysts remain aware of the results beforehand, and hence, they can make
preparations to take action accordingly.

S
TE
O
N
K
H
S

As the data is based on one population at a time, the information gathered is easy to manage and is more
E

reliable as far as obtaining accurate results is concerned. Therefore, the sampling distribution is an effective tool
N

in helping researchers, academicians, financial analysts, market strategists, and others make well-informed and
wise decisions.
IG

 Sampling distribution refers to studying the randomly chosen samples to understand the variations in
the outcome expected to be derived.

V

Many researchers, academicians, market strategists, etc., go ahead with it instead of choosing the
entire population.
 Sampling distribution of the mean, sampling distribution of proportion, and T-distribution are three
major types of finite-sample distribution.
 The central limit theorem states how the distribution still remains normal and almost accurate with
increasing sample size.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
How Does Sampling Distribution Work?

Sampling distribution in statistics represents the probability of varied outcomes when a study is conducted. It is
also known as finite-sample distribution. In the process, users collect samples randomly but from one chosen
population. A population is a group of people having the same attribute used for random sample collection in
terms of statistics.
However, the data collected is not based on the population but on samples collected from a specific population
to be studied. Thus, a sample becomes a subset of the chosen population. With sampling distribution, the

S
samples are studied to determine the probability of various outcomes occurring with respect to certain events.
For example, deriving data to understand the adverts that can help attract teenagers would require selecting a

TE
population of those aged between 13 and 19 only.

Using finite-sample distribution, users can calculate the mean, range, standard deviation, mean absolute value

O
of the deviation, variance, and unbiased estimate of the variance of the sample. No matter for what purpose
users wish to use the collected data, it helps strategists, statisticians, academicians, and financial analysts make

N
necessary preparations and take relevant actions with respect to the expected outcome.

As soon as users decide to utilize the data for further calculation, the next step is to develop a frequency
K
distribution with respect to individual sample statistics as calculated through the mean, variance, and other
methods. Next, they plot the frequency distribution for each of them on a graph to represent the variation in
the outcome. This representation is indicated on the distribution graph.
H

Influencing Factors
S

Moreover, the accuracy of the distribution depends on various factors, and the major ones that influence the
results include:
E

 Number of observations in the population. It is denoted by “N.”


N

 Number of observations in the sample. It is denoted by “n.”


 Methods adopted for choosing samples randomly. It leads to variation in the outcome.
IG

Types

The finite-sample distribution can be expressed in various forms. Here is a list of some of its types:
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
You are free to use this image on your website, templates, etc, Please provide us with an attribution link

N
K
#1 – Sampling Distribution of Mean

It is the probabilistic spread of all the means of samples of fixed size that users choose randomly from a
H

particular population. When they plot individual means on the graph, it indicates normal distribution. However,
the center of the graph is the mean of the finite-sample distribution, which is also the mean of that population.
S

#2 – Sampling Distribution of Proportion


E

This type of finite-sample distribution identifies the proportions of the population. The users select samples and
calculate the sample proportion. They, then, plot the resulting figures on the graph. The mean of the sample
N

proportions gathered from each sample group signifies the mean proportion of the population as a whole. For
example, a Vlogger collects data from a sample group to find out the proportion of it interested in watching its
IG

upcoming videos.

#3 – T-Distribution
V

People use this type of distribution when they are not well aware of the chosen population or when the sample
size is very small. This symmetrical form of distribution fulfills the condition of standard normal variate. As the
sample size increases, even T distribution tends to become very close to normal distribution. Users use it to find
out the mean of the population, statistical differences, etc.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Significance

This type of distribution plays a vital role in ensuring the outcome derived accurately represents the entire
population. However, reading or observing each individual in a population is difficult. Therefore, selecting
samples from the population randomly is an attempt to make sure the study conducted could help understand
the reactions, responses, grievances, or aspirations of a chosen population in the most effective way.

The method simplifies the path to statistical inference. Moreover, it allows analytical considerations to focus on

S
a static distribution rather than the mixed probabilistic spread of each chosen sample unit. This distribution
eliminates the variability present in the statistic.

TE
It provides us with an answer about the probable outcomes which are most likely to happen. In addition, it plays
a key role in inferential statistics and makes almost accurate inferences through chosen samples representing

O
the population.
Examples

Let us consider the following examples to understand the concept better:

Example #1
N
K
Sarah wants to analyze the number of teens riding a bicycle between two regions of 13-18.
H

Instead of considering each individual in the population of 13-18 years of age in the two regions, she selected
200 samples randomly from each area.
S

Here,
E

 The average count of the bicycle usage here is the sample mean.
N

 Each chosen sample has its own generated mean, and the distribution for the average mean is the
sample distribution.
IG

 The deviation obtained is termed the standard error.

She plots the data gathered from the sample on a graph to get a clear view of the finite-sample distribution.
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
Example #2

N
Researcher Samuel conducts a study to determine the average weight of 12-year-olds from five different
regions. Thus, he decides to collect 20 samples from each region. Firstly, the researcher collects 20 samples
K
from region A and finds out the mean of those samples. Then, he repeats the same for regions B, C, D, and E to
get a separate representation for each sample population.
H

The researcher computes the mean of the finite-sample distribution after finding the respective average weight
of 12-year-olds. In addition, he also calculates the standard deviation of sampling distribution and variance.
S

Central Limit Theorem


E

The discussion on sampling distribution is incomplete without the mention of the central limit theorem, which
N

states that the shape of the distribution will depend on the size of the sample.

According to this theorem, the increase in the sample size will reduce the chances of standard error, thereby
IG

keeping the distribution normal. When users plot the data on a graph, the shape will be close to the bell-curve
shape. In short, the more sample groups one studies, the better and more normal is the result/representation.
Frequently Asked Questions (FAQs)
V

What is sampling distribution?

Also known as finite-sample distribution, it is the statistical study where samples are randomly chosen from a
population with specific attributes to determine the probability of varied outcomes. The result obtained helps
academicians, financial analysts, market strategists, and researchers conclude a study, take relevant actions and
make wiser decisions.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
How to find the mean of the sampling distribution?

To calculate it, the users follow the below-mentioned steps:

• Choose samples randomly from a population


• Carry out the calculation of mean, variance, standard deviation, or other as per the requirement
• Obtain frequency distribution for each sample gathered
• Plot the data collected on the graph

S
Why is sampling distribution important?

TE
It is important to obtain a graphical representation to understand to what extent the outcome related to an
event could vary. In addition, it helps users to understand the population with which they are dealing. For
example, a businessman can figure out the probability of how fruitful selling their products or services would be.

O
At the same time, financial analysts can compare the investment vehicles and determine which one has more
potential to bear more profits, etc.

Introduction to Resampling methods N


Resampling is the method that consists of drawing repeated samples from the original data samples. The
K
method of Resampling is a nonparametric method of statistical inference. In other words, the method of
resampling does not involve the utilization of the generic distribution tables (for example, normal distribution
H

tables) in order to compute approximate p probability values.


S

Resampling involves the selection of randomized cases with replacement from the original data sample in such a
manner that each number of the sample drawn has a number of cases that are similar to the original data
E

sample. Due to replacement, the drawn number of samples that are used by the method of resampling consists
of repetitive cases.
N
IG

While reading about Machine Learning and Data Science we often come across a term called Imbalanced Class
Distribution , generally happens when observations in one of the classes are much higher or lower than any
other classes.
V

As Machine Learning algorithms tend to increase accuracy by reducing the error, they do not consider the class
distribution. This problem is prevalent in examples such as Fraud Detection, Anomaly Detection, Facial
recognition etc.
Two common methods of Resampling are –
1. Cross Validation
2. Bootstrapping

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Cross Validation –
Cross-Validation is used to estimate the test error associated with a model to evaluate its performance.
Validation set approach:
This is the most basic approach. It simply involves randomly dividing the dataset into two parts: first a training
set and second a validation set or hold-out set. The model is fit on the training set and the fitted model is used
to make predictions on the validation set.

S
TE
O
N
K
Leave-one-out-cross-validation:
LOOCV is a better option than the validation set approach. Instead of splitting the entire dataset into two halves
H

only one observation is used for validation and the rest is used to fit the model.
S
E
N
IG
V

k-fold cross-validation –
This approach involves randomly dividing the set of observations into k folds of nearly equal size. The first fold is
treated as a validation set and the model is fit on the remaining folds. The procedure is then repeated k times,

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
where a different group each time is treated as the validation set.

S
TE
O
N
K
Bootstrapping –
H

Bootstrap is a powerful statistical tool used to quantify the uncertainty of a given model. However, the real
power of bootstrap is that it could get applied to a wide range of models where the variability is hard to obtain
S

or not output automatically.


Challenges:
E

Algorithms in Machine Learning tend to produce unsatisfactory classifiers when handled with unbalanced
datasets.
N

For example, Movie Review datasets


Total Observations : 100
IG

Positive Dataset : 90

Negative Dataset : 10
V

Event rate : 2%

The main problem here is how to get a balanced dataset.


Challenges with standard ML algorithms:
Standard ML techniques such as Decision Tree and Logistic Regression have a bias towards the majority class,
and they tend to ignore the minority class. They tend only to predict the majority class, hence, having major

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
misclassification of the minority class in comparison with the majority class.
Evaluation of classification algorithm is measured by confusion matrix.

S
TE
O
A way to evaluate the results is by the confusion matrix, which shows the correct and incorrect predictions for
each class. In the first row, the first column indicates how many classes “True” got predicted correctly, and the
second column, how many classes “True” were predicted as “False”. In the second row, we note that all class
“False” entries were predicted as class “True”.
N
Therefore, the higher the diagonal values of the confusion matrix, the better the correct prediction.
K
Handling Approach:
 Random Over-sampling:
It aims to balance class distribution by randomly increasing minority class examples by replicating
H

them.
For example –
S

Total Observations : 100

Positive Dataset : 90
E

Negative Dataset : 10
N

Event Rate : 2%
IG

 We replicate Negative Dataset 15 times


Positive Dataset: 90

Negative Dataset after Replicating: 150


V

Total Observations: 190

Event Rate : 150/240= 63%

 SMOTE (Synthetic Minority Oversampling Technique) synthesises new minority instances between
existing minority instances. It randomly picks up the minority class and calculates the K-nearest

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
neighbour for that particular point. Finally, the synthetic points are added between the neighbours
and the chosen spot.
 Random Under-Sampling:
It aims to balance class distribution by randomly eliminating majority class examples.
For Example –
Total Observations : 100

Positive Dataset : 90

S
Negative Dataset : 10

TE
Event rate : 2%

We take 10% samples of Positive Dataset and combine it with Negative Dataset.

O
Positive Dataset after Random Under-Sampling : 10% of 90 = 9
N
K
Total observation after combining it with Negative Dataset: 10+9=19
H

Event Rate after Under-Sampling : 10/19 = 53%

 When instances of two different classes are very close to each other, we remove the instances of
S

the majority class to increase the spaces between the two classes. This helps in the classification
process.
E

 Cluster-based Over Sampling:


N

K means clustering algorithm is independently applied to both the class instances such as to
identify clusters in the datasets. All clusters are oversampled such that clusters of the same class
IG

have the same size.


For Example –
Total Observations : 100
V

Positive Dataset : 90

Negative Dataset : 10

Event Rate : 2%

 Majority Class Cluster:


Cluster 1: 20 Observations

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Cluster 2: 30 Observations
Cluster 3: 12 Observations
Cluster 4: 18 Observations
Cluster 5: 10 Observations
Minority Class Cluster:
Cluster 1: 8 Observations
Cluster 2: 12 Observations

S
After oversampling all clusters of the same class have the same number of observations.
Majority Class Cluster:

TE
Cluster 1: 20 Observations
Cluster 2: 20 Observations
Cluster 3: 20 Observations
Cluster 4: 20 Observations

O
Cluster 5: 20 Observations
Minority Class Cluster:
Cluster 1: 15 Observations
Cluster 2: 15 Observations N
K
Below is the implementation of some resampling techniques:
You can download the dataset from the given link below : Dataset download
H

 Python3

# importing libraries
S

import pandas as pd
E

import numpy as np
N

import seaborn as sns


IG

from sklearn.preprocessing import StandardScaler

from imblearn.under_sampling import RandomUnderSampler, TomekLinks

from imblearn.over_sampling import RandomOverSampler, SMOTE


V

 Python3

dataset = pd.read_csv(r'C:\Users\Abhishek\Desktop\creditcard.csv')

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

print("The Number of Samples in the dataset: ", len(dataset))

print('Class 0 :', round(dataset['Class'].value_counts()[0]

S
/len(dataset) * 100, 2), '% of the dataset')

TE
print('Class 1(Fraud) :', round(dataset['Class'].value_counts()[1]

/len(dataset) * 100, 2), '% of the dataset')

O
N
K
 Python3
H

X_data = dataset.iloc[:, :-1]


S

Y_data = dataset.iloc[:, -1:]


E

rus = RandomUnderSampler(random_state = 42)


N

X_res, y_res = rus.fit_resample(X_data, Y_data)


IG

X_res = pd.DataFrame(X_res)
V

Y_res = pd.DataFrame(y_res)

print("After Under Sampling Of Major Class Total Samples are :", len(Y_res))

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

print('Class 0 :', round(Y_res[0].value_counts()[0]

/len(Y_res) * 100, 2), '% of the dataset')

print('Class 1(Fraud) :', round(Y_res[0].value_counts()[1]

/len(Y_res) * 100, 2), '% of the dataset')

S
TE
O
 Python3

tl = TomekLinks()
N
K
X_res, y_res = tl.fit_resample(X_data, Y_data)
H

X_res = pd.DataFrame(X_res)
S

Y_res = pd.DataFrame(y_res)
E
N

print("After TomekLinks Under Sampling Of Major Class Total Samples are :", len(Y_res))
IG

print('Class 0 :', round(Y_res[0].value_counts()[0]

/len(Y_res) * 100, 2), '% of the dataset')


V

print('Class 1(Fraud) :', round(Y_res[0].value_counts()[1]

/len(Y_res) * 100, 2), '% of the dataset')

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

 Python3

ros = RandomOverSampler(random_state = 42)

S
TE
X_res, y_res = ros.fit_resample(X_data, Y_data)

O
X_res = pd.DataFrame(X_res)

Y_res = pd.DataFrame(y_res)

N
K
print("After Over Sampling Of Minor Class Total Samples are :", len(Y_res))

print('Class 0 :', round(Y_res[0].value_counts()[0]


H

/len(Y_res) * 100, 2), '% of the dataset')


S
E

print('Class 1(Fraud) :', round(Y_res[0].value_counts()[1]

/len(Y_res) * 100, 2), '% of the dataset')


N
IG
V

 Python3

sm = SMOTE(random_state = 42)

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

X_res, y_res = sm.fit_resample(X_data, Y_data)

X_res = pd.DataFrame(X_res)

Y_res = pd.DataFrame(y_res)

S
TE
print("After SMOTE Over Sampling Of Minor Class Total Samples are :", len(Y_res))

print('Class 0 :', round(Y_res[0].value_counts()[0]

/len(Y_res) * 100, 2), '% of the dataset')

O
print('Class 1(Fraud) :', round(Y_res[0].value_counts()[1]

/len(Y_res) * 100, 2), '% of the dataset')


N
K
H
S
E
N

Statistical inference
IG

Statistical inference is the process of analysing the result and making conclusions from data subject to random
variation. It is also called inferential statistics. Hypothesis testing and confidence intervals are the applications of
the statistical inference. Statistical inference is a method of making decisions about the parameters of a
V

population, based on random sampling. It helps to assess the relationship between the dependent and
independent variables. The purpose of statistical inference to estimate the uncertainty or sample to sample
variation. It allows us to provide a probable range of values for the true values of something in the population.
The components used for making statistical inference are:

 Sample Size

 Variability in the sample

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
 Size of the observed differences

Types of Statistical Inference

There are different types of statistical inferences that are extensively used for making conclusions. They are:

 One sample hypothesis testing

 Confidence Interval

S
 Pearson Correlation

 Bi-variate regression

TE
 Multi-variate regression

 Chi-square statistics and contingency table

O
ANOVA or T-test

Statistical Inference Procedure

The procedure involved in inferential statistics are:


N
K
Begin with a theory

 Create a research hypothesis

 Operationalize the variables


H

 Recognize the population to which the study results should apply


S

 Formulate a null hypothesis for this population

 Accumulate a sample from the population and continue the study


E

 Conduct statistical tests to see if the collected sample properties are adequately different from what
N

would be expected under the null hypothesis to be able to reject the null hypothesis

Statistical Inference Solution


IG

Statistical inference solutions produce efficient use of statistical data relating to groups of individuals or trials. It
deals with all characters, including the collection, investigation and analysis of data and organizing the collected
V

data. By statistical inference solution, people can acquire knowledge after starting their work in diverse fields.
Some statistical inference solution facts are:

 It is a common way to assume that the observed sample is of independent observations from a
population type like Poisson or normal

 Statistical inference solution is used to evaluate the parameter(s) of the expected model like normal
mean or binomial proportion

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Importance of Statistical Inference

Inferential Statistics is important to examine the data properly. To make an accurate conclusion, proper data
analysis is important to interpret the research results. It is majorly used in the future prediction for various
observations in different fields. It helps us to make inference about the data. The statistical inference has a wide
range of application in different fields, such as:

 Business Analysis

S
 Artificial Intelligence

 Financial Analysis

TE
 Fraud Detection

 Machine Learning

 Share Market

O
 Pharmaceutical Sector

Statistical Inference Examples

An example of statistical inference is given below.


N
K
Question: From the shuffled pack of cards, a card is drawn. This trial is repeated for 400 times, and the suits are
given below:
H

Suit Spade Clubs Hearts Diamonds


S

No.of times drawn 90 100 120 90


E
N

While a card is tried at random, then what is the probability of getting a


IG

1. Diamond cards
2. Black cards
3. Except for spade
V

Solution:

By statistical inference solution,

Total number of events = 400

i.e.,90+100+120+90=400

(1) The probability of getting diamond cards:

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Number of trials in which diamond card is drawn = 90

Therefore, P(diamond card) = 90/400 = 0.225

(2) The probability of getting black cards:

Number of trials in which black card showed up = 90+100 =190

Therefore, P(black card) = 190/400 = 0.475

S
(3) Except for spade

Number of trials other than spade showed up = 90+100+120 =310

TE
Therefore, P(except spade) = 310/400 = 0.775

O
Prediction Error

N
Prediction error refers to the difference between the predicted values made by some model and the actual
values.
K
Prediction error is often used in two settings:

1. Linear regression: Used to predict the value of some continuous response variable.
H

We typically measure the prediction error of a linear regression model with a metric known as RMSE, which
S

stands for root mean squared error.


E

It is calculated as:
N

RMSE = √Σ(ŷ i – y i ) 2 / n
IG

where:

 Σ is a symbol that means “sum”


V

 ŷi is the predicted value for the ith observation


 yi is the observed value for the ith observation
 n is the sample size

2. Logistic Regression: Used to predict the value of some binary response variable.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
One common way to measure the prediction error of a logistic regression model is with a metric known as the
total misclassification rate.

It is calculated as:

Total misclassification rate = (# incorrect predictions / # total predictions)

The lower the value for the misclassification rate, the better the model is able to predict the outcomes of the

S
response variable.

TE
The following examples show how to calculate prediction error for both a linear regression model and a logistic
regression model in practice.

Example 1: Calculating Prediction Error in Linear Regression

O
Suppose we use a regression model to predict the number of points that 10 players will score in a basketball
game.

N
The following table shows the predicted points from the model vs. the actual points the players scored:
K
H
S
E
N
IG
V

We would calculate the root mean squared error (RMSE) as:

 RMSE = √Σ(ŷ i – y i ) 2 / n

 RMSE = √(((14-12)2+(15-15)2+(18-20)2+(19-16)2+(25-20)2+(18-19)2+(12-16)2+(12-20)2+(15-16)2+(22-
16)2) / 10)
 RMSE = 4

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
The root mean squared error is 4. This tells us that the average deviation between the predicted points scored
and the actual points scored is 4.

Related: What is Considered a Good RMSE Value?

Example 2: Calculating Prediction Error in Logistic Regression

Suppose we use a logistic regression model to predict whether or not 10 college basketball players will get

S
drafted into the NBA.

TE
The following table shows the predicted outcome for each player vs. the actual outcome (1 = Drafted, 0 = Not
Drafted):

O
N
K
H
S
E

We would calculate the total misclassification rate as:


N

 Total misclassification rate = (# incorrect predictions / # total predictions)


IG

 Total misclassification rate = 4/10


 Total misclassification rate = 40%
V

The total misclassification rate is 40%.

This value is quite high, which indicates that the model doesn’t do a very good job of predicting whether or not
a player will get drafted.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
UNIT II SEARCH METHODS AND VISUALIZATION

Search by simulated Annealing – Stochastic, Adaptive search by Evaluation – Evaluation Strategies


–Genetic Algorithm – Genetic Programming – Visualization – Classification of Visual Data Analysis
Techniques – Data Types – Visualization Techniques – Interaction techniques – Specific Visual data
analysis Techniques

S
Search by simulated Annealing

TE
 Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a
given function. Specifically, it is a metaheuristic to approximate global optimization in a large search
space for an optimization problem. It is often used when the search space is discrete (for example

O
the traveling salesman problem, the boolean satisfiability problem, protein structure prediction,
and job-shop scheduling). For problems where finding an approximate global optimum is more
important than finding a precise local optimum in a fixed amount of time, simulated annealing may be


N
preferable to exact algorithms such as gradient descent or branch and bound.

The name of the algorithm comes from annealing in metallurgy, a technique involving heating and
K
controlled cooling of a material to alter its physical properties. Both are attributes of the material that
depend on their thermodynamic free energy. Heating and cooling the material affects both the
temperature and the thermodynamic free energy or Gibbs energy. Simulated annealing can be used for
H

very hard computational optimization problems where exact algorithms fail; even though it usually
achieves an approximate solution to the global minimum, it could be enough for many practical
S

problems.
E

 The problems solved by SA are currently formulated by an objective function of many variables, subject
to several constraints. In practice, the constraint can be penalized as part of the objective function.
N

 Similar techniques have been independently introduced on several occasions, including Pincus
(1970),[1] Khachaturyan et al (1979,[2] 1981[3]), Kirkpatrick, Gelatt and Vecchi (1983), and Cerny
IG

(1985).[4] In 1983, this approach was used by Kirkpatrick, Gelatt Jr., Vecchi, [5] for a solution of the
traveling salesman problem. They also proposed its current name, simulated annealing.
V

 This notion of slow cooling implemented in the simulated annealing algorithm is interpreted as a slow
decrease in the probability of accepting worse solutions as the solution space is explored. Accepting
worse solutions allows for a more extensive search for the global optimal solution. In general,
simulated annealing algorithms work as follows. The temperature progressively decreases from an
initial positive value to zero. At each time step, the algorithm randomly selects a solution close to the
current one, measures its quality, and moves to it according to the temperature-dependent

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
probabilities of selecting better or worse solutions, which during the search respectively remain at 1
(or positive) and decrease toward zero.

 The simulation can be performed either by a solution of kinetic equations for density functions [6][7] or
by using the stochastic sampling method.[5][8] The method is an adaptation of the Metropolis–Hastings
algorithm, a Monte Carlo method to generate sample states of a thermodynamic system, published
by N. Metropolis et al. in 1953

S
TE
Local Search with Simulated Annealing from Scratch

Generic Python code with 3 examples

O
A short refresher: local search is a heuristic that tries to improve a given solution by looking at neighbors. If the
objective value of a neighbor is better than the current objective value, the neighbor solution is accepted and the

N
search continues. Simulated annealing allows worse solutions to be accepted, this makes it possible to escape
local minima.
K
Simulated Annealing Generic Code
H

The code works as follows: we are going to create four code files. The most important one is sasolver.py, this file
contains the generic code for simulated annealing. The problems directory contains three examples of
S

optimization problems that we can run to test the SA solver.


E

This is the folder structure:


N
IG
V

For solving a problem with simulated annealing, we start to create a class that is quite generic:

import copy
import logging
import math

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
import numpy as np
import random
import time

from problems.knapsack import Knapsack


from problems.rastrigin import Rastrigin
from problems.tsp import TravelingSalesman

S
class SimulatedAnnealing():

TE
def __init__(self, problem):
self.problem = problem

def run_sa(self, max_iterations: int=100000, update_iterations: int=10000, time_limit: int=60, cooling_schedule:

O
str='lin'):
start = time.time()
best_solution = self.problem.baseline_solution()
best_obj = self.problem.score_solution(best_solution)
logging.info(f"First solution.
N
Objective: {round(best_obj, 2)} Solution: {best_solution}")
K
initial_temp = best_obj
prev_solution = copy.deepcopy(best_solution)
prev_obj = best_obj
H

iteration = 0
S

last_update = 0
E

while time.time() - start < time_limit:


iteration += 1
N

last_update += 1
accept = False
IG

curr_solution = self.problem.select_neighbor(copy.deepcopy(prev_solution))
curr_obj = self.problem.score_solution(curr_solution)
V

temperature = self._calculate_temperature(initial_temp, iteration, max_iterations, cooling_schedule)


acceptance_value = self._acceptance_criterion(curr_obj, prev_obj, temperature)

if (curr_obj <= prev_obj) or (temperature > 0 and random.random() < acceptance_value):


accept = True

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
if curr_obj < best_obj:
best_solution = copy.deepcopy(curr_solution)
best_obj = curr_obj
prev_solution = copy.deepcopy(curr_solution)
prev_obj = curr_obj
last_update = 0
logging.info(f"Better solution found. Objective: {round(best_obj, 2)} Solution: {curr_solution}")

S
else:
if accept:

TE
prev_obj = curr_obj
prev_solution = copy.deepcopy(curr_solution)
last_update = 0

O
if last_update >= update_iterations:
break

logging.info(f"Final solution: {best_solution} Objective: {round(best_obj, 2)}")


return best_solution
N
K
@staticmethod
def _acceptance_criterion(obj_new, obj_curr, temperature, mod=1):
H

"""
Determine the acceptance criterion (threshold for accepting a solution that is worse than the current one)
S

"""
E

diff = obj_new - obj_curr


try:
N

acc = math.exp(-diff / temperature)


except OverflowError:
IG

acc = -1
return acc
V

@staticmethod
def _calculate_temperature(initial_temp: int, iteration: int, max_iterations: int, how: str = None) -> float:
"""
Decrease the temperature to zero based on total number of iterations.
"""
if iteration >= max_iterations:
return -1

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
if how == "exp":
cooling_rate = 0.95
return initial_temp * (cooling_rate**iteration)
elif how == "quadratic":
cooling_rate = 0.01
return initial_temp / (1 + cooling_rate * iteration**2)
elif how == "log":

S
cooling_rate = 1.44
return initial_temp / (1 + cooling_rate * np.log(1 + iteration))

TE
elif how == "lin mult":
cooling_rate = 0.1
return initial_temp / (1 + cooling_rate * iteration)
else:

O
return initial_temp * (1 - iteration / max_iterations)

if __name__ == '__main__':
problem = 'rastrigin' # choose one of knapsack, tsp, rastrigin N
logging.basicConfig(filename=f'{problem}.log', encoding='utf-8', level=logging.INFO)
K
if problem == 'tsp':
problem = TravelingSalesman(n_locations=10, height=100, width=100)
sa = SimulatedAnnealing(problem)
H

final_solution = sa.run_sa()
problem._plot_solution(final_solution, title='final')
S

elif problem == 'knapsack':


E

problem = Knapsack(knapsack_capacity=100, n_items=10)


sa = SimulatedAnnealing(problem)
N

final_solution = sa.run_sa()
elif problem == 'rastrigin':
IG

problem = Rastrigin(n_dims=2)
sa = SimulatedAnnealing(problem)
final_solution = sa.run_sa()
V

This file is sasolver.py. It takes a problem as input, and then you can solve the problem with simulated
annealing, run_sa(). There are different ways to handle cooling, implemented in _calc_temperature. The
acceptance value is calculated based on the metropolis acceptance criterion.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
By modifying the problem = 'tsp' line, (below if __name__ == '__main__':,) it’s possible to select another problem
(replace tsp by knapsack or rastrigin).

We need to have three methods in the example problems to make this code work:

1. baseline_solution()
This method creates the first solution (starting point) for a problem.

S
2. score_solution(solution)

TE
The score_solution method calculates the objective value.

3. select_neighbor(solution)
We need to apply local moves to the solutions and select a neighbor, this will be implemented in

O
this method.

N
We are going to implement these three methods for three problems: traveling salesman, knapsack and the
Rastrigin function.
K
Example 1. Traveling Salesman
H

The first problem we are going to look at is the traveling salesman problem. In this problem, there are locations
that need to be visited. The goal is to minimize the distance traveled. Below you can see an example:
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
K
Example: 10 locations we want to visit and minimize the distance. Image by author.
H

import matplotlib.pyplot as plt


S

import numpy as np
import random
E

from typing import List


N

class TravelingSalesman():
IG

def __init__(self, n_locations: int = 10, locations: List[tuple] = None, height: int = 100, width: int = 100, starting_point:
int=0):
V

self.name = 'traveling salesman'


self.starting_point = starting_point
self.height = height
self.width = width
if locations is None:
locations = self._create_sample_data(n_locations)
self.locations = locations

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
self.n_locations = len(locations)
self.distances = self._create_distances()

def baseline_solution(self) -> list:


# route that follows the locations list
# start and end in start location
baseline = [self.starting_point] + [i for i in range(self.n_locations) if i != self.starting_point] + [self.starting_point]

S
self._plot_solution(baseline, title='baseline')
self._plot_solution(baseline, title='dots', only_dots=True)

TE
return baseline

def score_solution(self, solution: list) -> float:


# add all distances

O
return sum([self.distances[node, solution[i+1]] for i, node in enumerate(solution[:-1])])

def select_neighbor(self, solution: list) -> list:


# swap two locations (don't swap start and end)
indici = random.sample(range(1, self.n_locations), 2)
N
K
idx1, idx2 = indici[0], indici[1]
value1, value2 = solution[idx1], solution[idx2]
solution[idx1] = value2
H

solution[idx2] = value1
return solution
S
E

def _create_sample_data(self, n_locations: int) -> List[tuple]:


return [(random.random() * self.height, random.random() * self.width) for _ in range(n_locations)]
N

def _plot_solution(self, solution: list, title: str = 'tsp', only_dots: bool = False):
IG

plt.clf()
plt.rcParams["figure.figsize"] = [5, 5]
plt.rcParams["figure.autolayout"] = True
V

for n, location_id1 in enumerate(solution[:-1]):


location_id2 = solution[n+1]
x_values = [self.locations[location_id1][0], self.locations[location_id2][0]]
y_values = [self.locations[location_id1][1], self.locations[location_id2][1]]
if not only_dots:
plt.plot(x_values, y_values, 'bo', linestyle="-")
else:

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
plt.plot(x_values, y_values, 'bo')
plt.text(x_values[0]-2, y_values[0]+2, str(location_id1))
plt.savefig(f'{title}')

def _create_distances(self) -> np.array:


distances = np.zeros(shape=(self.n_locations, self.n_locations))
for ni, i in enumerate(self.locations):

S
for nj, j in enumerate(self.locations):
distances[ni, nj] = self._distance(i[0], i[1], j[0], j[1])

TE
return distances

@staticmethod
def _distance(x1: float, y1: float, x2: float, y2: float) -> float:

O
return np.sqrt((x2 - x1)**2 + (y2 - y1)**2)

N
In this problem, the baseline solution is created by visiting the locations in sequence (0 to 9). For the example, it
K
gives us this route:
H
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Baseline solution. Image by author.

This doesn’t look optimal, and it isn’t. A local move is defined by swapping two locations. The score of the
solution is the distance we need to travel. After running simulated annealing, this is the final solution:

S
TE
O
N
K
H
S

Final solution. Image by author.


E

That looks better!


N
IG

For small problems, this works okay (still not recommended). For larger ones, there are better solutions and
algorithms available, for example the Lin-Kernighan heuristic. What also helps is a better starting solution, e.g. a
greedy algorithm.
V

Example 2. Knapsack

The knapsack problem is a classic one, but for those who don’t know it, here follows an explanation.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Imagine you are in a cave full of beautiful treasures. Due to some unforeseen circumstances the cave is
collapsing. You have time to fill your knapsack with treasures and then you need to run away to safety. Of course,
you want to take the items with you that together bring most value. What items should you take?

S
TE
O
N
K
The knapsack problem. The knapsack has a capacity of 50. What items should you select to maximize the value?
Image by author.
H

The data you need to have for solving this problem is the capacity of the knapsack, the capacity needed for the
items and the value of the items.
S
E

Below the code that defines this problem:


N

import copy
import random
IG

import numpy as np
from typing import List
V

class Knapsack():
def __init__(self, knapsack_capacity: int, n_items: int = 20, item_values: list = None, item_capacities: list = None):
self.name = 'knapsack'
self.knapsack_capacity = knapsack_capacity
if item_values is None and item_capacities is None:
item_values, item_capacities = self._create_sample_data(n_items)

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
self.item_values = item_values
self.item_capacities = item_capacities
self.n_items = len(item_values)

def baseline_solution(self) -> list:


# select random items until the knapsack is full
capacity = 0

S
solution = []
while True:

TE
selected = random.choice([i for i in range(self.n_items) if i not in solution])
if capacity + self.item_capacities[selected] > self.knapsack_capacity:
break
else:

O
solution.append(selected)
capacity += self.item_capacities[selected]
return solution

def score_solution(self, solution: list) -> int:


N
K
# count the total value of this solution
return -1 * sum([self.item_values[i] for i in solution])
H

def select_neighbor(self, solution: list) -> list:


# local move: remove / add / swap items
S

solution_capacity = sum([self.item_capacities[i] for i in solution])


E

possible_to_add = [i for i in range(self.n_items) if self.item_capacities[i] <= self.knapsack_capacity -


solution_capacity and i not in solution]
N

if len(solution) == 0:
move = 'add'
IG

elif len(possible_to_add) > 0:


move = np.random.choice(['remove', 'add', 'swap'], p=[0.1, 0.6, 0.3])
else:
V

move = np.random.choice(['remove', 'swap'], p=[0.4, 0.6])


while True:
if move == 'remove':
solution.pop(random.randrange(len(solution)))
return solution
elif move == 'add':
new_solution = copy.deepcopy(solution)

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
new_item = random.choice(possible_to_add)
new_solution.append(new_item)
return new_solution
elif move == 'swap':
n=0
while n < 50:
new_solution = copy.deepcopy(solution)

S
in_item = random.choice([i for i in range(self.n_items) if i not in solution])
out_item = random.choice(range(len(solution)))

TE
new_solution.pop(out_item)
new_solution.append(in_item)
n += 1
if self._is_feasible(new_solution):

O
return new_solution
move = 'remove'

def _create_sample_data(self, n_items: int) -> List[list]:


item_values = random.sample(range(2, 1000), n_items)
N
K
item_capacities = random.sample(range(1, self.knapsack_capacity), n_items)
return item_values, item_capacities
H

def _is_feasible(self, solution: list) -> bool:


return sum([self.item_capacities[i] for i in solution]) <= self.knapsack_capacity
S
E

The baseline solution selects an item at random until the knapsack is full. The solution score is the sum of values
N

of the items in the knapsack, multiplied by -1. This is necessary because the SA solver minimizes a given objective.
In this situation, there are three local moves possible: adding an item, removing an item or swapping two items.
IG

This makes it possible to reach every solution possible in solution space. If we swap an item, we need to check if
the new solution is feasible.
V

In the next image you can see a sample run log file. There are 10 items we need to choose from. On top the item
values, below the capacity the items take, and on the third line the value densities (item value divided by item
capacity). Then the solution process starts. The solution contains the index number(s) of the selected items. In
the final solution, items 4, 5 and 8 are selected (counting starts at 0):

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
Example 3. Rastrigin Function

TE
A function that is used often to test optimization algorithms, is the Rastrigin function. In 3D it looks like this:

O
N
K
H
S
E
N

Rastrigin function 3D plot. Image by author.


IG

It has many local optima. The goal is to find the global minimum, which is at coordinate (0, 0). It is easier to see in
V

a contour plot:

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
K
Contour plot of the Rastrigin function. Image by author.
H

The landscape consists of many local optima with the highest ones in the four corners and the lowest ones in the
center.
S
E

We can try to find the global minimum with simulated annealing. This problem is continuous instead of discrete,
and we want to find the values for x and y that minimize the Rastrigin function.
N

The Rastrigin function is defined with a n-dimensional domain as:


IG
V

Let’s try to find the optimum for the function with three dimensions (x, y, and z). The domain is defined
by x and y, so the problem is exactly as the plots above.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

from collections import Counter


import numpy as np
import random
from typing import List

S
class Rastrigin():
def __init__(self, n_dims: int = 2):

TE
self.name = 'rastrigin'
self.n_dims = n_dims

def baseline_solution(self) -> list:

O
solution = [random.uniform(-5.12, 5.12) for _ in range(self.n_dims)]
return solution

def score_solution(self, solution: list) -> float:


N
K
score = self.n_dims * 10 + sum([(x**2 - 10*np.cos(2*np.pi*x)) for x in solution])
return score
H

def select_neighbor(self, solution: list, step_size: float = 0.1) -> list:


perturbation = step_size * np.random.randn(self.n_dims)
S

neighbor = solution + perturbation


while not self._is_feasible(neighbor):
E

perturbation = step_size * np.random.randn(self.n_dims)


neighbor = solution + perturbation
N

return neighbor
IG

def _is_feasible(self, solution: list) -> bool:


return bool([x >= -5.12 and x <= 5.12 for x in solution])
V

For the baseline solution, we select a random float for x and y between -5.12 and 5.12. The score of the solution
is equal to z (the outcome of the Rastrigin function). A neighbor is selected by taking a step into a random
direction with a step size set to 0.1. The feasibility check is done to make sure we stay in the domain.

A log of a run:

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
The final solution comes really close to the optimum.

O
But watch out, if you run the algorithm with more dimensions, it’s not guaranteed that you find the optimum:

N
K
H
S
E

As you can see, the final solution is a local optimum instead of the global one. It find goods coordinates for the
N

first two variables, but the third one is equal to 0.985, which is far away from 0. It’s important to verify the results
you get. This specific example will work well by finetuning the SA parameters, but for more dimensions you
IG

should probably use another optimization technique that performs better.


V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

Genetic Algorithms
Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong to the larger part of evolutionary
algorithms. Genetic algorithms are based on the ideas of natural selection and genetics. These are intelligent
exploitation of random search provided with historical data to direct the search into the region of better
performance in solution space. They are commonly used to generate high-quality solutions for optimization
problems and search problems.
Genetic algorithms simulate the process of natural selection which means those species who can adapt to

S
changes in their environment are able to survive and reproduce and go to next generation. In simple words,
they simulate “survival of the fittest” among individual of consecutive generation for solving a problem. Each

TE
generation consist of a population of individuals and each individual represents a point in search space and
possible solution. Each individual is represented as a string of character/integer/float/bits. This string is
analogous to the Chromosome.
Foundation of Genetic Algorithms

O
Genetic algorithms are based on an analogy with genetic structure and behaviour of chromosomes of the
population. Following is the foundation of GAs based on this analogy –

1.
2.
Individual in population compete for resources and mate
N
Those individuals who are successful (fittest) then mate to create more offspring than others
K
3. Genes from “fittest” parent propagate throughout the generation, that is sometimes parents
create offspring which is better than either parent.
4. Thus each successive generation is more suited for their environment.
H

Search space
The population of individuals are maintained within search space. Each individual represents a solution in search
S

space for given problem. Each individual is coded as a finite length vector (analogous to chromosome) of
components. These variable components are analogous to Genes. Thus a chromosome (individual) is composed
E

of several genes (variable components).


N
IG
V

Fitness Score
A Fitness Score is given to each individual which shows the ability of an individual to “compete”. The individual
having optimal fitness score (or near optimal) are sought.
The GAs maintains the population of n individuals (chromosome/solutions) along with their fitness scores.The
individuals having better fitness scores are given more chance to reproduce than others. The individuals with
better fitness scores are selected who mate and produce better offspring by combining chromosomes of
parents. The population size is static so the room has to be created for new arrivals. So, some individuals die

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
and get replaced by new arrivals eventually creating new generation when all the mating opportunity of the old
population is exhausted. It is hoped that over successive generations better solutions will arrive while least fit
die.
Each new generation has on average more “better genes” than the individual (solution) of previous generations.
Thus each new generations have better “partial solutions” than previous generations. Once the offspring
produced having no significant difference from offspring produced by previous populations, the population is
converged. The algorithm is said to be converged to a set of solutions for the problem.

S
Operators of Genetic Algorithms
Once the initial generation is created, the algorithm evolves the generation using following operators –

TE
1) Selection Operator: The idea is to give preference to the individuals with good fitness scores and allow them
to pass their genes to successive generations.
2) Crossover Operator: This represents mating between individuals. Two individuals are selected using selection
operator and crossover sites are chosen randomly. Then the genes at these crossover sites are exchanged thus

O
creating a completely new individual (offspring). For example –

N
K
3) Mutation Operator: The key idea is to insert random genes in offspring to maintain the diversity in the
H

population to avoid premature convergence. For example –


S
E
N
IG

The whole algorithm can be summarized as –

1) Randomly initialize populations p


V

2) Determine fitness of population

3) Until convergence repeat:

a) Select parents from population

b) Crossover and generate new population

c) Perform mutation on new population

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
d) Calculate fitness for new population

Example problem and solution using Genetic Algorithms


Given a target string, the goal is to produce target string starting from a random string of the same length. In
the following implementation, following analogies are made –

 Characters A-Z, a-z, 0-9, and other special symbols are considered as genes
 A string generated by these characters is considered as chromosome/solution/Individual
Fitness score is the number of characters which differ from characters in target string at a particular index. So

S
individual having lower fitness value is given more preference.

TE
 C++
 Python3

// C++ program to create target string, starting from

O
// random string using Genetic Algorithm

#include <bits/stdc++.h>
N
K
using namespace std;
H

// Number of individuals in each generation

#define POPULATION_SIZE 100


S
E

// Valid Genes
N

const string GENES = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP"\


IG

"QRSTUVWXYZ 1234567890, .-;:_!\"#%&/()=?@${[]}";

// Target string to be generated


V

const string TARGET = "I love GeeksforGeeks";

// Function to generate random numbers in given range

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

int random_num(int start, int end)

int range = (end-start)+1;

int random_int = start+(rand()%range);

return random_int;

S
}

TE
// Create random genes for mutation

char mutated_genes()

O
{

int len = GENES.size();

int r = random_num(0, len-1);


N
K
return GENES[r];

}
H
S

// create chromosome or string of genes

string create_gnome()
E

{
N

int len = TARGET.size();


IG

string gnome = "";

for(int i = 0;i<len;i++)
V

gnome += mutated_genes();

return gnome;

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

// Class representing individual in population

class Individual

public:

string chromosome;

S
int fitness;

TE
Individual(string chromosome);

Individual mate(Individual parent2);

int cal_fitness();

O
};

Individual::Individual(string chromosome)
N
K
{

this->chromosome = chromosome;
H

fitness = cal_fitness();
S

};
E

// Perform mating and produce new offspring


N

Individual Individual::mate(Individual par2)


IG

// chromosome for offspring


V

string child_chromosome = "";

int len = chromosome.size();

for(int i = 0;i<len;i++)

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

// random probability

float p = random_num(0, 100)/100;

// if prob is less than 0.45, insert gene

S
// from parent 1

TE
if(p < 0.45)

child_chromosome += chromosome[i];

O
// if prob is between 0.45 and 0.90, insert

// gene from parent 2

else if(p < 0.90)


N
K
child_chromosome += par2.chromosome[i];
H

// otherwise insert random gene(mutate),


S

// for maintaining diversity

else
E

child_chromosome += mutated_genes();
N

}
IG

// create new Individual(offspring) using


V

// generated chromosome for offspring

return Individual(child_chromosome);

};

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

// Calculate fitness score, it is the number of

// characters in string which differ from target

// string.

int Individual::cal_fitness()

S
{

TE
int len = TARGET.size();

int fitness = 0;

for(int i = 0;i<len;i++)

O
{

if(chromosome[i] != TARGET[i])

fitness++;
N
K
}

return fitness;
H

};
S

// Overloading < operator


E

bool operator<(const Individual &ind1, const Individual &ind2)


N

{
IG

return ind1.fitness < ind2.fitness;

}
V

// Driver code

int main()

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

srand((unsigned)(time(0)));

// current generation

int generation = 0;

S
vector<Individual> population;

TE
bool found = false;

// create initial population

O
for(int i = 0;i<POPULATION_SIZE;i++)

string gnome = create_gnome();


N
K
population.push_back(Individual(gnome));

}
H
S

while(! found)

{
E

// sort the population in increasing order of fitness score


N

sort(population.begin(), population.end());
IG

// if the individual having lowest fitness score ie.


V

// 0 then we know that we have reached to the target

// and break the loop

if(population[0].fitness <= 0)

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

found = true;

break;

// Otherwise generate new offsprings for new generation

S
vector<Individual> new_generation;

TE
// Perform Elitism, that mean 10% of fittest population

// goes to the next generation

O
int s = (10*POPULATION_SIZE)/100;

for(int i = 0;i<s;i++)

new_generation.push_back(population[i]);
N
K
// From 50% of fittest population, Individuals
H

// will mate to produce offspring


S

s = (90*POPULATION_SIZE)/100;

for(int i = 0;i<s;i++)
E

{
N

int len = population.size();


IG

int r = random_num(0, 50);

Individual parent1 = population[r];


V

r = random_num(0, 50);

Individual parent2 = population[r];

Individual offspring = parent1.mate(parent2);

new_generation.push_back(offspring);

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

population = new_generation;

cout<< "Generation: " << generation << "\t";

cout<< "String: "<< population[0].chromosome <<"\t";

cout<< "Fitness: "<< population[0].fitness << "\n";

S
TE
generation++;

cout<< "Generation: " << generation << "\t";

O
cout<< "String: "<< population[0].chromosome <<"\t";

}
cout<< "Fitness: "<< population[0].fitness << "\n";
N
K
Output:
Generation: 1 String: tO{"-?=jH[k8=B4]Oe@} Fitness: 18
H

Generation: 2 String: tO{"-?=jH[k8=B4]Oe@} Fitness: 18


S

Generation: 3 String: .#lRWf9k_Ifslw #O$k_ Fitness: 17

Generation: 4 String: .-1Rq?9mHqk3Wo]3rek_ Fitness: 16


E

Generation: 5 String: .-1Rq?9mHqk3Wo]3rek_ Fitness: 16


N

Generation: 6 String: A#ldW) #lIkslw cVek) Fitness: 14


IG

Generation: 7 String: A#ldW) #lIkslw cVek) Fitness: 14

Generation: 8 String: (, o x _x%Rs=, 6Peek3 Fitness: 13

.
V

Generation: 29 String: I lope Geeks#o, Geeks Fitness: 3

Generation: 30 String: I loMe GeeksfoBGeeks Fitness: 2

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Generation: 31 String: I love Geeksfo0Geeks Fitness: 1

Generation: 32 String: I love Geeksfo0Geeks Fitness: 1

Generation: 33 String: I love Geeksfo0Geeks Fitness: 1

Generation: 34 String: I love GeeksforGeeks Fitness: 0

Note: Every-time algorithm start with random strings, so output may differ
As we can see from the output, our algorithm sometimes stuck at a local optimum solution, this can be further

S
improved by updating fitness score calculation algorithm or by tweaking mutation and crossover operators.

TE
Why use Genetic Algorithms
 They are Robust
 Provide optimisation over large space state.
 Unlike traditional AI, they do not break on slight change in input or presence of noise

O
Application of Genetic Algorithms
Genetic algorithms have many applications, some of them are –



Recurrent Neural Network
Mutation testing
N
K
 Code breaking
 Filtering and signal processing
 Learning fuzzy rule base etc
H
S

What is Genetic Programming?


E

Genetic programming is a technique to create algorithms that can program themselves by simulating biological
N

breeding and Darwinian evolution. Instead of programming a model that can solve a particular problem, genetic
programming only provides a general objective and lets the model figure out the details itself. The basic
IG

approach is to let the machine automatically test various simple evolutionary algorithms and then “breed” the
most successful programs in new generations.
V

While applying the same natural selection, crossover, mutations and other reproduction approaches as
evolutionary and genetic algorithms, gene programming takes the process a step further by automatically
creating new models and letting the system select its own goals.

The entire process is still an area of active research. One of the biggest obstacles to widespread adoption of this
genetic machine learning approach is quantifying the fitness function, i.e to what degree each new program is
contributing to reaching the desired goal.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

How to Set Up a Genetic Program:


To set up a basic genetic program, a human first needs to define a high-level statement of the problem through
several preparatory steps:

 Specify terminals – For example, independent variables of the problem, zero-argument functions, or
random constants for each branch of program that will be go through evolution.

S
 Define initial “primitive” functions for each branch of the genetic program.
 Choose a fitness measure – this measures the fitness of individuals in the population to determine if they

TE
should reproduce.
 Any special performance parameters for controlling the run.
 Select termination criterion and methods for reaching the run’s goals.

O
From there, the program runs automatically, without requiring any training data.

A random initial population (generation 0) of simple programs will be generated based upon basic functions and
terminal defined by the human.
N
Each program will be executed and its fitness measured from the results. The most successful or “fit” programs
K
will be breeding stock to birth a new generation, with some new population members directly copied
(reproduction), some through crossover (randomly breeding parts of the programs) and random mutations.
Unlike evolutionary programming, an additional architecture-altering operation is chosen, similar to a species, to
H

control the framework of the new generation.


S
E

A genetic algorithm(or GA) is a search technique used in computing to find true or approximate solutions to
optimization and search problems based on the theory of natural selection and evolutionary biology.
N
IG

Vocabulary:

1. Individual: Any possible solution


V

2. Population: Group of all individuals

3. Fitness: Target function that we are optimizing (each individual has a fitness)

It starts from a population of randomly generated individuals and happens in generations. In each generation, the
fitness of every individual in the population is evaluated, multiple individuals are selected (based on their fitness),
and modified to form a new population. The new population is used in the next iteration of the algorithm. The

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness
level has been reached for the population.

For example, consider a population of giraffes. Giraffes with slightly longer necks could feed on leaves of higher
branches when all lower ones had been eaten off. They had a better chance of survival. Favorable characteristics
propagated through generations of giraffes. Now, evolved species have long necks.

S
On the other hand, Genetic programming is a specialization of genetic algorithms where each individual is a
computer program. The main difference between genetic programming and genetic algorithms is the

TE
representation of the solution.

The output of the genetic algorithm is a quantity, while the output of the genetic programming is another computer

O
program.

Introduction:
N
Genetic Programming(or GP) introduced by Mr. John Koza is a type of Evolutionary Algorithm (EA), a subset of
K
machine learning. EAs are used to discover solutions to problems humans do not know how to solve, directly.
H

Genetic programming is a systematic method for getting computers to automatically solve a problem and
iteratively transform a population of computer programs into a new generation of programs by applying analogs
S

of naturally occurring genetic operations. The genetic operations include crossover, mutation, reproduction, gene
duplication, and gene deletion.
E

Genetic Programming is the evolution of computer programs.


N
IG

Working:

It starts with an initial set of programs composed of functions and terminals that may be handpicked or randomly
V

generated. The functions may be standard arithmetic operations, programming operations, mathematical
functions, or logical functions. The programs compete with each other over given input and expected output
data. Each computer program in the population is measured in terms of how well it performs in a particular
problem environment. This measure is called a fitness measure. Top-performing programs are picked, mutation
and breeding are performed on them to generate the next generation. Next-generation competes with each
other, the process goes on until the perfect program is evolved.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
Main Loop of Genetic Programming(source)

TE
Here’s a concise explanation:

O
Make an Initial population, Evaluation(assign a fitness function to each program in the population), Selection of
‘fitter’ individuals, Variation(mutation, crossover, etc), Iteration, and Termination.

Program Representation:
N
K
Programs in genetic programming are expressed as syntax trees rather than as lines of code. Trees can be easily
evaluated recursively. The tree includes nodes(functions) and links(terminals). The nodes indicate the instructions
H

to execute and the links indicate the arguments for each instruction. For illustration consider the implementation
of the equation: x ∗ ((x%y) − sin(x)) + exp(x) shown in figure below. In this example, terminal set = {x, y} and
S

function set = {+, −, ∗, %, cos, sin, exp}


E
N
IG
V

GP syntax tree(source)

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
FIVE MAJOR PREPARATORY STEPS FOR GP :

 Determining the set of terminals

 Determining the set of functions

 Determining the fitness measure

 Determining the parameters for the run

S
 Determining the method for designating a result and the criterion for terminating a run.

TE
O
N
K
Now let’s look into how genetic operators like crossover and mutation can be applied on a subtree.
H

The crossover operator is used for the exchanging of subtrees between two individuals.
S
E
N
IG
V

Crossover Operator for Genetic Programming

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
GP applies point mutation in which a random node in the tree is chosen and replaced with a different random
generated subtree.

S
TE
Mutation Operator for Genetic Programming

O
The various types of Genetic Programming include:


Tree-based Genetic Programming
N
K
Stack-based Genetic Programming

 Linear Genetic Programming (LGP)


H

 Grammatical Evolution
S

 Extended Compact Genetic Programming (ECGP)


E

 Cartesian Genetic Programming (CGP)


N

 Probabilistic Incremental Program Evolution (PIPE)


IG

 Strongly Typed Genetic Programming (STGP)

 Genetic Improvement of Software for Multiple Objectives (GISMO)


V

Four commonly languages used for genetic programming are:

1. LISP

2. Matlab

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
3. Python

4. Java

5. C

Applications:

S
Intrusion Detection Systems, cancer research, curve fitting, data modeling, symbolic regression, feature selection,

TE
classification, game playing, quantum computing, etc.

These two videos will help you understand the working of genetic programming and its applications.

O
What is data visualization? N
K
H
S
E
N
IG
V

Data visualization is the graphical representation of information and data. By using visual elements like charts,
graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
patterns in data. Additionally, it provides an excellent way for employees or business owners to present data to
non-technical audiences without confusion.

In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of
information and make data-driven decisions.

What are the advantages and disadvantages of data visualization?

S
Something as simple as presenting data in graphic format may seem to have no downsides. But sometimes data
can be misrepresented or misinterpreted when placed in the wrong style of data visualization. When choosing

TE
to create a data visualization, it’s best to keep both the advantages and disadvantages in mind.

Advantages

O
Our eyes are drawn to colors and patterns. We can quickly identify red from blue, and squares from circles. Our
culture is visual, including everything from art and advertisements to TV and movies. Data visualization is

N
another form of visual art that grabs our interest and keeps our eyes on the message. When we see a chart,
we quickly see trends and outliers. If we can see something, we internalize it quickly. It’s storytelling with a
purpose. If you’ve ever stared at a massive spreadsheet of data and couldn’t see a trend, you know how much
K
more effective a visualization can be.

Some other advantages of data visualization include:


H

 Easily sharing information.


S

 Interactively explore opportunities.


E

 Visualize patterns and relationships.


N

Disadvantages
IG

While there are many advantages, some of the disadvantages may seem less obvious. For example, when
viewing a visualization with many different datapoints, it’s easy to make an inaccurate assumption. Or
V

sometimes the visualization is just designed wrong so that it’s biased or confusing.

Some other disadvantages include:

 Biased or inaccurate information.

 Correlation doesn’t always mean causation.

 Core messages can get lost in translation.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Why data visualization is important

The importance of data visualization is simple: it helps people see, interact with, and better understand data.
Whether simple or complex, the right visualization can bring everyone on the same page, regardless of their
level of expertise.

It’s hard to think of a professional industry that doesn’t benefit from making data more understandable. Every
STEM field benefits from understanding data—and so do fields in government, finance, marketing, history,

S
consumer goods, service industries, education, sports, and so on.

TE
While we’ll always wax poetically about data visualization (you’re on the Tableau website, after all) there are
practical, real-life applications that are undeniable. And, since visualization is so prolific, it’s also one of the most
useful professional skills to develop. The better you can convey your points visually, whether in a dashboard or a
slide deck, the better you can leverage that information. The concept of the citizen data scientist is on the rise.

O
Skill sets are changing to accommodate a data-driven world. It is increasingly valuable for professionals to be
able to use data to make decisions and use visuals to tell stories of when data informs the who, what, when,
where, and how.
N
While traditional education typically draws a distinct line between creative storytelling and technical analysis,
K
the modern professional world also values those who can cross between the two: data visualization sits right in
the middle of analysis and visual storytelling.
H

Data visualization and big data


S

As the “age of Big Data” kicks into high gear, visualization is an increasingly key tool to make sense of the
trillions of rows of data generated every day. Data visualization helps to tell stories by curating data into a form
E

easier to understand, highlighting the trends and outliers. A good visualization tells a story, removing the noise
from data and highlighting useful information.
N

However, it’s not simply as easy as just dressing up a graph to make it look better or slapping on the “info” part
IG

of an infographic. Effective data visualization is a delicate balancing act between form and function. The plainest
graph could be too boring to catch any notice or it make tell a powerful point; the most stunning visualization
could utterly fail at conveying the right message or it could speak volumes. The data and the visuals need to
V

work together, and there’s an art to combining great analysis with great storytelling.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

Table of Contents
1. What is Data Visualization?

2. Benefits of Good Data Visualization

3. Different Types of Analysis for Data Visualization

S
4. Univariate Analysis Techniques for Data Visualization

TE
 Distribution Plot
 Box and Whisker Plot
 Violin Plot

O
5. Bivariate Analysis Techniques for Data Visualization



Line Plot
Bar Plot
N
K
 Scatter Plot
H

What is Data Visualization?


Data visualization is defined as a graphical representation that contains the information and the data.
S

By using visual elements like charts, graphs, and maps, data visualization techniques provide an accessible way to
E

see and understand trends, outliers, and patterns in data.

In modern days we have a lot of data in our hands i.e, in the world of Big Data, data visualization tools, and
N

technologies are crucial to analyze massive amounts of information and make data-driven decisions.
IG

It is used in many areas such as:

 To model complex events.


V

 Visualize phenomenons that cannot be observed directly, such as weather patterns, medical conditions,
or mathematical relationships.

Benefits of Good Data Visualization


Since our eyes can capture the colors and patterns, therefore, we can quickly identify the red portion from blue,
square from the circle, our culture is visual, including everything from art and advertisements to TV and movies.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
So, Data visualization is another technique of visual art that grabs our interest and keeps our main focus on the
message captured with the help of eyes.

Whenever we visualize a chart, we quickly identify the trends and outliers present in the dataset.

The basic uses of the Data Visualization technique are as follows:

 It is a powerful technique to explore the data with presentable and interpretable results.

S
 In the data mining process, it acts as a primary step in the pre-processing portion.
 It supports the data cleaning process by finding incorrect data and corrupted or missing values.

TE
 It also helps to construct and select variables, which means we have to determine which variable to
include and discard in the analysis.
 In the process of Data Reduction, it also plays a crucial role while combining the categories.

O
N
K
H
S
E
N
IG
V

Image Source: Google Images

Different Types of Analysis for Data Visualization


Mainly, there are three different types of analysis for Data Visualization:

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Univariate Analysis: In the univariate analysis, we will be using a single feature to analyze almost all of its
properties.

Bivariate Analysis: When we compare the data between exactly 2 features then it is known as bivariate analysis.

Multivariate Analysis: In the multivariate analysis, we will be comparing more than 2 variables.

NOTE:

S
In this article, our main goal is to understand the following concepts:

TE
 How do find some inferences from the data visualization techniques?
 In which condition, which technique is more useful than others?

O
We are not going to deep dive into the coding/implementation part of different techniques on a particular dataset
but we try to find the answer to the above questions and understand only the snippet code with the help of
sample plots for each of the data visualization techniques.

Now, let’s started with the different Data Visualization techniques:


N
K
H

Univariate Analysis Techniques for Data Visualization


1. Distribution Plot
S

 It is one of the best univariate plots to know about the distribution of data.
E

 When we want to analyze the impact on the target variable(output) with respect to an independent
variable(input), we use distribution plots a lot.
N

 This plot gives us a combination of both probability density functions(pdf) and histogram in a single plot.
IG

Implementation:

 The distribution plot is present in the Seaborn package.


V

The code snippet is as follows:

Python Code:

Some conclusions inferred from the above distribution plot:

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
From the above distribution plot we can conclude the following observations:

 We have observed that we created a distribution plot on the feature ‘Age’(input variable) and we used
different colors for the Survival status(output variable) as it is the class to be predicted.
 There is a huge overlapping area between the PDFs for different combinations.
 In this plot, the sharp block-like structures are called histograms, and the smoothed curve is known as
the Probability density function(PDF).

S
NOTE:

TE
The Probability density function(PDF) of a curve can help us to capture the underlying distribution of that feature
which is one major takeaway from Data visualization or Exploratory Data Analysis(EDA).

2. Box and Whisker Plot

O
 This plot can be used to obtain more statistical details about the data.



The straight lines at the maximum and minimum are also called whiskers.
Points that lie outside the whiskers will be considered as an outlier. N
The box plot also gives us a description of the 25th, 50th,75th quartiles.
K
 With the help of a box plot, we can also determine the Interquartile range(IQR) where maximum details
of the data will be present. Therefore, it can also give us a clear idea about the outliers in the dataset.
H
S
E
N
IG
V

Fig. General Diagram for a Box-plot

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Implementation:

 Boxplot is available in the Seaborn library.


 Here x is considered as the dependent variable and y is considered as the independent variable. These
box plots come under univariate analysis, which means that we are exploring data only with one variable.
 Here we are trying to check the impact of a feature named “axil_nodes” on the class named “Survival
status” and not between any two independent features.

S
The code snippet is as follows:

TE
sns.boxplot(x='SurvStat',y='axil_nodes',data=hb)

O
N
K
H
S

Some conclusions inferred from the above box plot:


E

From the above box and whisker plot we can conclude the following observations:


N

How much data is present in the 1st quartile and how many points are outliers etc.
 For class 1, we can see that it is very little or no data is present between the median and the 1st quartile.
IG

 There are more outliers for class 1 in the feature named axil_nodes.

NOTE:
V

We can get details about outliers that will help us to well prepare the data before feeding it to a model since
outliers influence a lot of Machine learning models.

3. Violin Plot

 The violin plots can be considered as a combination of Box plot at the middle and distribution plots(Kernel
Density Estimation) on both sides of the data.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
 This can give us the description of the distribution of the dataset like whether the distribution
is multimodal, Skewness, etc.
 It also gives us useful information like a 95% confidence interval.

S
TE
O
N
K
Fig. General Diagram for a Violin-plot
Implementation:
H

 The Violin plot is present in the Seaborn package.


S

The code snippet is as follows:


E

sns.violinplot(x='SurvStat',y='op_yr',data=hb,size=6)
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Some conclusions inferred from the above violin plot:

From the above violin plot we can conclude the following observations:

 The median of both classes is close to 63.


 The maximum number of persons with class 2 has an op_yr value of 65 whereas, for persons in class1,
the maximum value is around 60.
 Also, the 3rd quartile to median has a lesser number of data points than the median to the 1st quartile.

S
TE
Bivariate Analysis Techniques for Data Visualization
1. Line Plot

 This is the plot that you can see in the nook and corners of any sort of analysis between 2 variables.

O
 The line plots are nothing but the values on a series of data points will be connected with straight lines.
 The plot may seem very simple but it has more applications not only in machine learning but in many
other areas.

Implementation:
N
K
 The line plot is present in the Matplotlib package.

The code snippet is as follows:


H

plt.plot(x,y)
S
E
N
IG
V

Some conclusions inferred from the above line plot:

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
From the above line plot we can conclude the following observations:

 These are used right from performing distribution Comparison using Q-Q plots to CV tuning using
the elbow method.
 Used to analyze the performance of a model using the ROC- AUC curve.

2. Bar Plot

S
 This is one of the widely used plots, that we would have seen multiple times not just in data analysis, but
we use this plot also wherever there is a trend analysis in many fields.

TE
 Though it may seem simple it is powerful in analyzing data like sales figures every week, revenue from a
product, Number of visitors to a site on each day of a week, etc.

Implementation:

O
 The bar plot is present in the Matplotlib package.

The code snippet is as follows: N


K
plt.bar(x,y)
H
S
E
N
IG
V

Some conclusions inferred from the above bar plot:

From the above bar plot we can conclude the following observations:

 We can visualize the data in a cool plot and can convey the details straight forward to others.
 This plot may be simple and clear but it’s not much frequently used in Data science applications.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
3. Scatter Plot

 It is one of the most commonly used plots used for visualizing simple data in Machine learning and Data
Science.
 This plot describes us as a representation, where each point in the entire dataset is present with respect
to any 2 to 3 features(Columns).
 Scatter plots are available in both 2-D as well as in 3-D. The 2-D scatter plot is the common one, where
we will primarily try to find the patterns, clusters, and separability of the data.

S
Implementation:

TE
 The scatter plot is present in the Matplotlib package.

The code snippet is as follows:

O
plt.scatter(x,y)

N
K
H
S
E
N

Some conclusions inferred from the above Scatter plot:


IG

From the above Scatter plot we can conclude the following observations:


V

The colors are assigned to different data points based on how they were present in the dataset i.e, target
column representation.
 We can color the data points as per their class label given in the dataset.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
What is a data type?

A data type is an attribute of a piece of data that tells a device how the end-user might interact with the data.
You can also think of them as categorizations that different coding programs might combine in order to execute
certain functions. Most programming languages including C++ and Java use the same basic data types.

Related: What Is Java? (With FAQs)

S
10 data types

TE
Each programming language uses a different combination of data types. Some of these types include:

1. Integer

O
Integer data types often represent whole numbers in programming. An integer's value moves from one integer
to another without acknowledging fractional numbers in between. The number of digits can vary based on the
device, and some programming languages may allow negative values.

2. Character
N
K
In coding, alphabet letters denote characters. Programmers might represent these data types as (CHAR) or
(VARGCHAR), and they can be single characters or a string of letters. Characters are usually fixed-length figures
that default to 1 octet—an 8-bit unit of digital information—but can increase to 65,000 octets.
H

Related: Learning How To Code


S

3. Date
E

This data type stores a calendar date with other programming information. Dates are typically a combination of
N

integers or numerical figures. Since these are typically integer values, some programs can store basic
mathematical operations like days elapsed since certain events or days away from an upcoming event.
IG

4. Floating point (real)


V

Floating-point data types represent fractional numbers in programming. There are two main floating-point data
types, which vary depending on the number of allowable values in the string:

 Float: A data type that typically allows up to seven points after a decimal.
 Double: A data type that allows up to 15 points after a decimal.

5. Long

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Long data types are often 32- or 64-bit integers in code. Sometimes, these can represent integers with 20 digits
in either direction, positive or negative. Programmers use an ampersand to indicate the data type is a long
variable.

6. Short

Similar to the long data type, a short is a variable integer. Programmers represent these as whole numbers, and
they can be positive or negative. Sometimes a short data type is a single integer.

S
7. String

TE
A string data type is a combination of characters that can be either constant or variable. This often incorporates
a sequence of character data types that result in specific commands depending on the programming language.
Strings can include both upper and lowercase letters, numbers and punctuation.

O
8. Boolean

N
Boolean data is what programmers use to show logic in code. It's typically one of two values—true or false—
intended to clarify conditional statements. These can be responses to "if/when" scenarios, where code indicates
K
if a user performs a certain action. When this happens, the Boolean data directs the program's response, which
determines the next code in the sequence.
H

9. Nothing
S

The nothing data type shows that a code has no value. This might indicate that a code is missing, the
programmer started the code incorrectly or that there were values that defy the intended logic. It's also called
E

the "nullable type."


N

10. Void

Similar to the nothing type, the void type contains a value that the code cannot process. Void data types tell a
IG

user that the code can't return a response. Programmers might use or encounter the void data type in early
system testing when there are no responses programmed yet for future steps.
V

Data type examples

Data types can vary based on size, length and use depending on the coding language. Here are some examples
of the data types listed above that you might encounter when programming:

Integer

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Integers are digits that account for whole numbers only. Some integer examples include:

 425
 65
 9

Character

S
Characters are letters or other figures that programmers might combine in a string. Examples of characters
include:

TE
Date

Programmers can include individual dates, ranges or differences in their code. Some examples might be:

O
 2009-09-15


1998-11-30 09:45:87
SYSDATETIME () N
K
Long

Long data types are whole numbers, both positive and negative, that have many place values. Examples include:
H

 -398,741,129,664,271
S

 9,000,000,125,356,546
E

Short
N

Short data types can be up to several integers, but they are always less than long data. Examples include:


IG

-27,400
 5,428
 17
V

Floating point (real)

Float data types might look like this:

 float num1 = 1.45E2


 float num2 = 9.34567

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Similar but often longer in length, an example of the floating-point double might be:

 double num2 = 1.87358497267482791E+222


 double num2 = 3.198728764857268945

The floating-point double type can provide more accurate values, but it also may require additional memory to
process.

S
String

TE
Strings are a combination of figures that includes letters and punctuation. In some code, this might look like this:

 String a = new String("Open")


 String b = new String("The door")

O
 String c = new String("Say Hello!")

These can be independent commands, or they can work together.

Boolean
N
K
Boolean data can help guide the logic in a code. Here are some examples of how you might use this:
H

 bool baseballIsBest = false;


 bool footballIsBest = true;
S

Depending on the program, the code may direct the end-user to different screens based on their selection.
E

Nothing
N

Nothing means a code has no value, but the programmer coded something other than the digit 0. This is often
"Null," "NaN" or "Nothing" in code. An example of this is:
IG

 Dim option = Nothing


 Program.WriteWords(x Is Nothing)
V

Void

The void data type in coding functions as an indicator that code might not have a function or a response yet.
This might appear as:

 int function_name (void)

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
What are the Interactive Data Visualization?
Interactive data visualization supports exploratory thinking so that decision-makers can actively investigate
intriguing findings. Interactive visualization supports faster decision making, greater data access and stronger
user engagement along with desirable results in several other metrics. Some of the key findings include:

 70% of the interactive visualization adopters improve collaboration and knowledge sharing.

 64% of the interactive visualization adopters improve user trust in underlying data.

 Interactive Visualization users engage data more frequently.

S
 Interactive Visualizes are more likely than static visualizers to be satisfied easily with the use of

TE
analytical tools.
Examples of Interactive Data Visualization

 MailChimp (Interactive Annual Report)

O
The New Yorker (Interactive Visual Content for Media)

 SAP Intouch Wall (Interactive Executive Presentations)


Bloomberg (Interactive Financial Data)

The Lowy Institute Poll (Interactive Polling Data) N


K
Pulizter Centre (Interactive Data-Driven Campaigns)
In extras some other examples of communication tool:

 Sales Presentations
H

 Training Modules
S

 Product Collateral

 Shareholder Presentations
E

 Educational Content
N

 Press Releases and PR Content


Visuals are especially helpful when you’re trying to find relationships among hundreds or thousands of variables
IG

to determine their relative importance. Click to explore about, Data Visualization with React and GraphQL

What are the benefits of Interactive Data Visualization?


V

The benefits of Interactive Data Visualization are listed below:

Identifying Causes and Trends Quickly


Today’s 93% of human communication is visual, and it tells that human eyes are processing images 60,000 times
more than the text-based data.

Relationships Between Tasks and Business Operations

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
By interacting with data to put the focus on specific metrics, decision-makers are able to compare specific
throughout definable timeframes.

Telling Story Through Data


By allowing users to interact with data present in a clear visual manner, a data-intensive story becomes visible.

Use Cases of Data Visualization

 History of Bruce Springsteen

S
 Apollo

 Keuzestress

TE
 Marvel Cinematic Universe

 The Many Moons of Jupiter

 Newsmap

O
 The Big Mac Index


CF Weather Charts

Galaxy of Covers

Red Bull Party Visualization


N
K
 Figures in the Sky

 The Women of Data Viz


H
S

Understanding Data Visualization


E

Techniques
N
IG

Data visualization is a graphical representation of information and data. By using visual elements like charts,
graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and
patterns in data. This blog on data visualization techniques will help you understand detailed techniques and
V

benefits.

In the world of Big Data, data visualization in Python tools and technologies are essential to analyze massive
amounts of information and make data-driven decisions.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Benefits of good data visualization

Our eyes are drawn to colours and patterns. We can quickly identify red from blue, and square from the circle.
Our culture is visual, including everything from art and advertisements to TV and movies.

Data visualization is another form of visual art that grabs our interest and keeps our eyes on the message. When
we see a chart, we quickly see trends and outliers. If we can see something, we internalize it quickly. It’s
storytelling with a purpose. If you’ve ever stared at a massive spreadsheet of data and couldn’t see a trend, you

S
know how much more effective a visualization can be. The uses of Data Visualization as follows.

TE
 Powerful way to explore data with presentable results.
 Primary use is the pre-processing portion of the data mining process.
 Supports the data cleaning process by finding incorrect and missing values.

O
 For variable derivation and selection means to determine which variable to include and
discarded in the analysis.

N
Also play a role in combining categories as part of the data reduction process.
Data Visualization Techniques
K
 Box plots
 Histograms
 Heat maps
H

 Charts
 Tree maps
S

 Word Cloud/Network diagram


E

Enrol Now – Data Visualization Using Tableau course for free offered by Great Learning Academy.
N

Box Plots
IG

The image above is a box plot. A boxplot is a standardized way of displaying the distribution of data based on a
five-number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can tell
you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly
V

your data is grouped, and if and how your data is skewed.

A box plot is a graph that gives you a good indication of how the values in the data are spread out. Although box
plots may seem primitive in comparison to a histogram or density plot, they have the advantage of taking up
less space, which is useful when comparing distributions between many groups or datasets. For some
distributions/datasets, you will find that you need more information than the measures of central tendency
(median, mean, and mode). You need to have information on the variability or dispersion of the data.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
List of Methods to Visualize Data

 Column Chart: It is also called a vertical bar chart where each category is represented by a
rectangle. The height of the rectangle is proportional to the values that are plotted.
 Bar Graph: It has rectangular bars in which the lengths are proportional to the values which
are represented.
 Stacked Bar Graph: It is a bar style graph that has various components stacked together so
that apart from the bar, the components can also be compared to each other.

S
 Stacked Column Chart: It is similar to a stacked bar; however, the data is stacked horizontally.

TE
Area Chart: It combines the line chart and bar chart to show how the numeric values of one or
more groups change over the progress of a viable area.
 Dual Axis Chart: It combines a column chart and a line chart and then compares the two
variables.

O
 Line Graph: The data points are connected through a straight line; therefore, creating a
representation of the changing trend.

 N
Mekko Chart: It can be called a two-dimensional stacked chart with varying column widths.
Pie Chart: It is a chart where various components of a data set are presented in the form of a
pie which represents their proportion in the entire data set.
K
 Waterfall Chart: With the help of this chart, the increasing effect of sequentially introduced
positive or negative values can be understood.

H

Bubble Chart: It is a multi-variable graph that is a hybrid of Scatter Plot and a Proportional
Area Chart.

S

Scatter Plot Chart: It is also called a scatter chart or scatter graph. Dots are used to denote
values for two different numeric variables.
E

 Bullet Graph: It is a variation of a bar graph. A bullet graph is used to swap dashboard gauges
and meters.
N

 Funnel Chart: The chart determines the flow of users with the help of a business or sales
process.
IG

 Heat Map: It is a technique of data visualization that shows the level of instances as color in
two dimensions.
Five Number Summary of Box Plot
V

Minimum Q1 -1.5*IQR

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

First quartile (Q1/25th The middle number between the smallest number (not the “minimum”)
Percentile)”: and the median of the dataset

Median (Q2/50th
the middle value of the dataset
Percentile)”:

S
Third quartile (Q3/75th the middle value between the median and the highest value (not the
Percentile)”: “maximum”) of the dataset.

TE
Maximum” Q3 + 1.5*IQR

O
interquartile range (IQR) 25th to the 75th percentile.

Histograms
N
K
A histogram is a graphical display of data using bars of different heights. In a histogram, each bar groups
numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and
H

spread of continuous sample data.


S

It is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set
of continuous data. This allows the inspection of the data for its underlying distribution (e.g., normal
E

distribution), outliers, skewness, etc. It is an accurate representation of the distribution of numerical data, it
relates only one variable. Includes bin or bucket- the range of values that divide the entire range of values into a
N

series of intervals and then count how many values fall into each interval.
IG

Bins are consecutive, non- overlapping intervals of a variable. As the adjacent bins leave no gaps, the rectangles
of histogram touch each other to indicate that the original value is continuous.
V

Histograms are based on area, not height of bars

In a histogram, the height of the bar does not necessarily indicate how many occurrences of scores there were
within each bin. It is the product of height multiplied by the width of the bin that indicates the frequency of
occurrences within that bin. One of the reasons that the height of the bars is often incorrectly assessed as
indicating the frequency and not the area of the bar is because a lot of histograms often have equally spaced
bars (bins), and under these circumstances, the height of the bin does reflect the frequency.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Also Read: Machine Learning Interview Questions

Histogram Vs Bar Chart

The major difference is that a histogram is only used to plot the frequency of score occurrences in a continuous
data set that has been divided into classes, called bins. Bar charts, on the other hand, can be used for a lot of
other types of variables including ordinal and nominal data sets.

S
Heat Maps

TE
A heat map is data analysis software that uses colour the way a bar graph uses height and width: as a data
visualization tool.
If you’re looking at a web page and you want to know which areas get the most attention, a heat map shows

O
you in a visual way that’s easy to assimilate and make decisions from. It is a graphical representation of data
where the individual values contained in a matrix are represented as colours. Useful for two purposes: for

conveyed in a two-dimensional table. N


visualizing correlation tables and for visualizing missing values in the data. In both cases, the information is

Note that heat maps are useful when examining a large number of values, but they are not a replacement for
K
more precise graphical displays, such as bar charts, because colour differences cannot be perceived accurately.

Charts
H

Line Chart
S

The simplest technique, a line plot is used to plot the relationship or dependence of one variable on another. To
E

plot the relationship between the two variables, we can simply call the plot function.
N

Bar Charts
IG

Bar charts are used for comparing the quantities of different categories or groups. Values of a category are
represented with the help of bars and they can be configured with vertical or horizontal bars, with the length or
height of each bar representing the value.
V

Pie Chart

It is a circular statistical graph which decides slices to illustrate numerical proportion. Here the arc length of
each slide is proportional to the quantity it represents. As a rule, they are used to compare the parts of a whole
and are most effective when there are limited components and when text and percentages are included to

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
describe the content. However, they can be difficult to interpret because the human eye has a hard time
estimating areas and comparing visual angles.

Scatter Charts

Another common visualization technique is a scatter plot that is a two-dimensional plot representing the joint
variation of two data items. Each marker (symbols such as dots, squares and plus signs) represents an
observation. The marker position indicates the value for each observation. When you assign more than two

S
measures, a scatter plot matrix is produced that is a series scatter plot displaying every possible pairing of the

TE
measures that are assigned to the visualization. Scatter plots are used for examining the relationship, or
correlations, between X and Y variables.

Bubble Charts

O
It is a variation of scatter chart in which the data points are replaced with bubbles, and an additional dimension
of data is represented in the size of the bubbles.

Timeline Charts
N
K
Timeline charts illustrate events, in chronological order — for example the progress of a project, advertising
campaign, acquisition process — in whatever unit of time the data was recorded — for example week, month,
H

year, quarter. It shows the chronological sequence of past or future events on a timescale.
S

Tree Maps
E

A treemap is a visualization that displays hierarchically organized data as a set of nested rectangles, parent
elements being tiled with their child elements. The sizes and colours of rectangles are proportional to the values
N

of the data points they represent. A leaf node rectangle has an area proportional to the specified dimension of
the data. Depending on the choice, the leaf node is coloured, sized or both according to chosen attributes. They
IG

make efficient use of space, thus display thousands of items on the screen simultaneously.

Word Clouds and Network Diagrams for Unstructured Data


V

The variety of big data brings challenges because semi-structured, and unstructured data require new
visualization techniques. A word cloud visual represents the frequency of a word within a body of text with its
relative size in the cloud. This technique is used on unstructured data as a way to display high- or low-frequency
words.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Another visualization technique that can be used for semi-structured or unstructured data is the network
diagram. Network diagrams represent relationships as nodes (individual actors within the network) and ties
(relationships between the individuals). They are used in many applications, for example for analysis of social
networks or mapping product sales across geographic areas.

Learn all about Data Visualization with Power BI with this free course.

S
FAQs Related to Data Visualization

TE
What are the techniques of Visualization?
A: The visualization techniques include Pie and Donut Charts, Histogram Plot, Scatter Plot, Kernel Density
Estimation for Non-Parametric Data, Box and Whisker Plot for Large Data, Word Clouds and Network Diagrams
for Unstructured Data, and Correlation Matrices.

O
 What are the types of visualization?

N
A: The various types of visualization include Column Chart, Line Graph, Bar Graph, Stacked Bar Graph, Dual-Axis
Chart, Pie Chart, Mekko Chart, Bubble Chart, Scatter Chart, and Bullet Graph.
K
 What are the various visualization techniques used in data analysis?
A: Various visualization techniques are used in data analysis. A few of them include Box and Whisker Plot for
Large Data, Histogram Plot, and Word Clouds and Network Diagrams for Unstructured Data, to name a few.
H


S

How do I start visualizing?


A: You need to have a basic understanding of data and present it without misleading the data. Once you
E

understand it, you can further take up an online course or tutorials.


N

 What are the two basic types of data visualization?


A: The two very basic types of data visualization are exploration and explanation.
IG

 Which is the best visualization tool?


A: Some of the best visualization tools include Visme, Tableau, Infogram, Whatagraph, Sisense, DataBox,
V

ChartBlocks, DataWrapper, etc.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

UNIT III MINING DATA STREAMS

Introduction To Streams Concepts – Stream Data Model and Architecture - Stream Computing -
Sampling Data in a Stream – Filtering Streams – Counting Distinct Elements in a Stream –
Estimating Moments – Counting Oneness in a Window – Decaying Window - Real time Analytics
Platform(RTAP) Applications - Case Studies - Real Time Sentiment Analysis, Stock Market

S
Predictions

TE
Stream in Data Analytics
.

O
Introduction to stream concepts :

N
A data stream is an existing, continuous, ordered (implicitly by entrance time or explicitly by timestamp) chain of
items. It is unfeasible to control the order in which units arrive, nor it is feasible to locally capture stream in its
entirety.
K
It is enormous volumes of data, items arrive at a high rate.

Types of Data Streams :


H

 Data stream –
A data stream is a(possibly unchained) sequence of tuples. Each tuple comprised of a set of attributes, similar to
S

a row in a database table.

 Transactional data stream –


E

It is a log interconnection between entities


N

1. Credit card – purchases by consumers from producer


2. Telecommunications – phone calls by callers to the dialed parties
IG

3. Web – accesses by clients of information at servers


 Measurement data streams –
1. Sensor Networks – a physical natural phenomenon, road traffic
V

2. IP Network – traffic at router interfaces


3. Earth climate – temperature, humidity level at weather stations
Examples of Stream Sources-
1. Sensor Data –
In navigation systems, sensor data is used. Imagine a temperature sensor floating about in the
ocean, sending back to the base station a reading of the surface temperature each hour. The data
generated by this sensor is a stream of real numbers. We have 3.5 terabytes arriving every day and

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
we for sure need to think about what we can be kept continuing and what can only be archived.

2. Image Data –
Satellites frequently send down-to-earth streams containing many terabytes of images per day.
Surveillance cameras generate images with lower resolution than satellites, but there can be
numerous of them, each producing a stream of images at a break of 1 second each.

S
3. Internet and Web Traffic –
A bobbing node in the center of the internet receives streams of IP packets from many inputs and

TE
paths them to its outputs. Websites receive streams of heterogeneous types. For example, Google
receives a hundred million search queries per day.
Characteristics of Data Streams :
1. Large volumes of continuous data, possibly infinite.

O
2. Steady changing and requires a fast, real-time response.
3. Data stream captures nicely our data processing needs of today.
4.
5.
6.
Random access is expensive and a single scan algorithm
Store only the summary of the data seen so far. N
Maximum stream data are at a pretty low level or multidimensional in creation, needs multilevel
K
and multidimensional treatment.
Applications of Data Streams :
1. Fraud perception
H

2. Real-time goods dealing


3. Consumer enterprise
S

4. Observing and describing on inside IT systems


E

Advantages of Data Streams :


 This data is helpful in upgrading sales
N

 Help in recognizing the fallacy


 Helps in minimizing costs
IG

 It provides details to react swiftly to risk


Disadvantages of Data Streams :
 Lack of security of data in the cloud
V

 Hold cloud donor subordination


 Off-premises warehouse of details introduces the probable for disconnection

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
What is streaming data architecture? Find out how to stream data model and architecture in big data
Before we get to streaming data architecture, it is vital that you first understand streaming data. Streaming data
is a general term used to describe data that is generated continuously at high velocity and in large volumes.
A stream data source is characterized by continuous time-stamped logs that document events in real time.
Examples include a sensor reporting the current temperature, or a user clicking a link on a web page. Stream
data sources include:

 Server and security logs

S
 Clickstream data from websites and apps

TE
 IoT sensors

 Real-time advertising platforms

O
N
K
H
S
E
N

Therefore, a streaming data architecture is a dedicated network of software components capable of ingesting
IG

and processing copious amounts of stream data from many sources. Unlike conventional data architecture
solutions, which focus on batch reading and writing, a streaming data architecture ingests data as it is generated
in its raw form, stores it, and may incorporate different components for real-time data processing and
V

manipulation.

An effective streaming architecture must account for the distinctive characteristics of data streams which tend
to generate copious amounts of structured and semi-structured data that requires ETL and pre-processing to be
useful.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Due to its complexity, stream processing cannot be solved with one ETL tool or database. That’s why
organizations need to adopt solutions consisting of multiple building blocks that can be combined with data
pipelines within the organization’s data architecture.

Although stream processing was initially considered a niche technology, it is hard to find a modern business that
does not have an eCommerce site, an online advertising strategy, an app, or products enabled by IoT.

Each of these digital assets generates real-time event data streams, thus fueling the need to implement a

S
streaming data architecture capable of handling powerful, complex, and real-time analytics.

TE
Batch processing vs. real-time stream processing
In batch data processing, data is downloaded in batches before being processed, stored, and analyzed. On the
other hand, stream data ingest data continuously, allowing it to be processed simultaneously and in real-time.

O
N
K
H
S
E
N
IG
V

The complexity of the current business requirements has rendered legacy data processing methods obsolete
because they do not collect and analyze data in real-time. This doesn’t work for modern organizations as they
need to act on data in real-time before it becomes stale.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Benefits of stream data processing
The main benefit of stream processing is real-time insight. We live in an information age where new data is
constantly being created. Organizations that leverage streaming data analytics can take advantage of real-time
information from internal and external assets to inform their decisions, drive innovation and improve their
overall strategy. Here are a few other benefits of data stream processing:

Handle the never-ending stream of events natively

S
Batch processing tools need to gather batches of data and integrate the batches to gain a meaningful
conclusion. By reducing the overhead delays associated with batching events, organizations can gain instant

TE
insights from huge amounts of stream data.

Real-time data analytics and insights

O
Stream processing processes and analyzes data in real-time to provide up-to-the-minute data analytics and
insights. This is very beneficial to companies that need real-time tracking and streaming data analytics on their

N
processes. It also comes in handy in other scenarios such as detection of fraud and data breaches and machine
performance analysis.
K
H
S
E
N
IG
V

Simplified data scalability

Batch processing systems may be overwhelmed by growing volumes of data, necessitating the addition of other
resources, or a complete redesign of the architecture. On the other hand, modern streaming data architectures
are hyper-scalable, with a single stream processing architecture capable of processing gigabytes of data per
second [4].

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Detecting patterns in time-series data

Detection of patterns in time-series data, such as analyzing trends in website traffic statistics, requires data to
be continuously collected, processed, and analyzed. This process is considerably more complex in batch
processing as it divides data into batches, which may result in certain occurrences being split across different
batches.

Increased ROI

S
The ability to collect, analyze and act on real-time data gives organizations a competitive edge in their

TE
respective marketplaces. Real-time analytics makes organizations more responsive to customer needs, market
trends, and business opportunities.

Improved customer satisfaction

O
Organizations rely on customer feedback to gauge what they are doing right and what they can improve on.

N
Organizations that respond to customer complaints and act on them promptly generally have a good reputation
[5].
K
Fast responsiveness to customer complaints, for example, pays dividends when it comes to online reviews and
word-of-mouth advertising, which can be deciding factor for attracting prospective customers and converting
them into actual customers.
H

Losses reduction
S

In addition to supporting customer retention, stream processing can prevent losses as well by providing
E

warnings of impending issues such as financial downturns, data breaches, system outages, and other issues that
negatively affect business outcomes. With real-time information, a business can mitigate, or even prevent the
N

impact of these events.


IG

Streaming data architecture: Use cases


Traditional batch architectures may suffice in small-scale applications [6]. However, when it comes to streaming
sources like servers, sensors, clickstream data from apps, real-time advertising, and security logs, stream data
V

becomes a vital necessity as some of these processes may generate up to a gigabyte of data per second.

Stream processing is also becoming a vital component in many enterprise data infrastructures. For example,
organizations can use clickstream analytics to track website visitor behaviors and tailor their content
accordingly.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Likewise, historical data analytics can help retailers show relevant suggestions and prevent shopping cart
abandonment. Another common use case scenario is IoT data analysis, which typically involves analyzing large
streams of data from connected devices and sensors.

Streaming data architecture: Challenges


Streaming data architectures require new technologies and process bottlenecks. The intricate complexity of
these systems can lead to failure, especially when components and processes stall or become too slow [7]. Here

S
are some of the most common challenges in streaming data architecture, along with possible solutions.

TE
O
N
K
Business Integration hiccups
H

Most organizations have many lines of business and applications teams, each working concurrently on its own
S

mission and challenges. For the most part, this works fairly seamlessly for a while until various teams need to
integrate and manipulate real-time event data streams.
E

Organizations can federate the events by multiple integration points so that the actions of one or more teams
don’t inadvertently disrupt the entire system.
N

Scalability bottlenecks
IG

As an organization grows, so do its datasets. When the current system is unable to handle the growing datasets,
operations become a major problem. For example, backups take much longer and consume a significant
V

number of resources. Similarly, rebuilding indexes, reorganizing historical data, and defragmenting storage
becomes more time-consuming and resource-intensive operations.

To solve this, organizations can check the production environment loads. By test-running the expected load of
the system using past data before implementing it, they can find and fix problems [8].

Fault Tolerance and data guarantees

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
These are crucial considerations when working with stream processing or any other distributed system. Since
data comes from different sources in varying volumes and formats, an organization’s systems must be able to
stop disruptions from any point of failure and effectively store large streams of data.

Components of a streaming data architecture


Streaming data architectures are built on an assembly line of proprietary and open-source software solutions
that address specific problems such as data integration, stream processing, storage and real-time analysis. Here

S
are some of its components:

TE
Message broker (Stream Processor)

O
N
K
H
S
E

This message broker collects data from a source, also known as a producer, converts it to a standard message
format, and then streams it for consumption by other components such as data warehouses, and ETL tools,
N

among others.
IG

Despite their high throughput, stream processors don’t do any data transformation or task scheduling. First-
V

generation stream processors such as Apache ActiveMQ and RabbitMQ relied on the Message Oriented
Middleware (MOM) paradigm. These systems were later replaced by hyper-format messaging platforms (stream
processors), which are better suited for a streaming paradigm.

Unlike the legacy MOM brokers, message brokers hold up high-performance capabilities, have a huge capacity
for message traffic, and are highly focused on streaming with minimal support requirements for task scheduling
and data transformations.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Stream processors can act as a proxy between two applications whereby communication is achieved through
ques. In that case, we can refer to them as point-to-point brokers. Alternatively, if an application is broadcasting
a single message or dataset to multiple applications, we can say that the broker is acting as a Publish/Subscribe
model.

Batch and real-time ETL tools

Stream data processes are vital components of the big data architecture in data-intensive organizations. In most

S
cases, data from multiple message brokers must be transformed and structured before the data sets can be
analyzed, typically using SQL-based analytics tools

TE
This can also be achieved using an ETL tool or other platform that receives queries from users, gathers events
from message queues, then generates results by applying the query. Other processes such as performing
additional joins, aggregations, and transformations can also run concurrently with the process. The result may

O
be an action, a visualization, an API call, an alert, or in other cases, a new data stream.

Streaming Data Storage


N
Due to the sheer volume and multi-structured nature of event streams, organizations typically store their data
K
in the cloud to serve as an operational data lake. Data lakes offer long-term and low-cost solutions for storing
massive amounts of event data. They also offer a flexible integration point where tools outside your streaming
data architecture can access data.
H

Data analytics/Serverless query engine


S

After the stream data is processed and stored, it should be analyzed to give actionable value. For this, you need
E

data analytics tools such as query engines, text search engines, and streaming data analytics tools like Amazon
Kinesis and Azure Stream Analytics.
N

Streaming architecture patterns


IG

Even with a robust streaming data architecture, you still need streaming architecture patterns to build reliable,
secure, scalable applications in the cloud. They include:
V

Idempotent Producer

A typical event streaming platform cannot deal with duplicate events in an event stream. That’s where the
idempotent producer pattern comes in. This pattern deals with duplicate events by assigning each producer a
producer ID (PID). Every time it sends a message to the broker, it includes its PID along with a monotonically
increasing sequence number.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
Event Splitter
N
Data sources mostly produce messages with multiple elements. The event splitter works by splitting an event
K
into multiple events. For instance, it can split an eCommerce order event into multiple events per order item,
making it easy to perform streaming data analytics.
H

Event Grouper
S

In some cases, events only become significant after they happen several times. For instance, an eCommerce
business will tempt parcel delivery at least three times before asking a customer to collect their order from the
E

depot.
N
IG
V

The business achieves this by grouping logically similar events, then counting the number of occurrences over a
given period.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Claim-check pattern

Message-based architectures often have to send, receive and manipulate large messages, such as in video
processing and image recognition. Since it is not recommended to send such large messages directly to the
message bus, organizations can send the claim check to the messaging platform instead and store the message
on an external service.

S
Final thoughts on streaming data architecture and streaming data analytics
As stream data models and architecture in big data become a vital component in the development of modern

TE
data platforms, organizations are shifting from legacy monolithic architectures to a more decentralized model to
promote flexibility and scalability. The resulting effect is the delivery of robust and expedient solutions that not
only improve service delivery but also give an organization a competitive edge

O
Stream Computing N
K
H
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
 Stream computing, the long-held dream of “high real-time computing” and “high-throughput
computing”, with programs that compute continuous data streams, has opened up the new era of
future computing due to big data, which is a datasets that is large, fast, dispersed, unstructured, and
beyond the ability of available hardware and software facilities to undertake their acquisition, access,
analytics, and application in a reasonable amount of time and space .
 Stream computing is a computing paradigm that reads data from collections of software or hardware
sensors in a stream form and computes continuous data streams, where feedback results should be in

S
a real-time data stream as well. A data stream is a sequence of data sets, and a continuous stream is an
infinite sequence of data sets, and parallel streams have more than one stream to be processed at the

TE
same time.
 Stream computing is one effective way to support big data by providing extremely low-latency
velocities with massively parallel processing architectures, and is becoming the fastest and most
efficient way to obtain useful knowledge from big data, allowing organizations to react quickly when

O
problems appear or to predict new trends in the near future .
 A big data input stream has the characteristics of high speed, real time, and large volume for

N
applications such as sensor networks, network monitoring, micro blog, web exploring, social
networking, and so on.
K
H
S
E
N
IG


 These data sources often take the form of continuous data streams, and timely analysis of such a data
V

stream is very important as the life cycle of most data is very short
 Furthermore, the volume of data is so high that there is no enough space for storage, and not all data
need to be stored. Thus, the storing-then-computing batch computing model does not fit at all. Nearly
all data in big data environments have the feature of streams, and stream computing has appeared to
solve the dilemma of big data computing by computing data online 3 within real-time constraints
.Consequently, the stream computing model will be a new trend for high-throughput computing in the
big data era.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
K
H
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
K
H
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

Stream sampling is the process of collecting a representative sample


of the elements of a data stream. The sample is usually much smaller
than the entire stream, but can be designed to retain many important
characteristics of the stream, and can be used to estimate many
important aggregates on the stream.

S
TE
O
N
K
H
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
K
H
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
K
H
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
K
H
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
K
H
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
K
H
S
E
N

Counting Distinct Elements in a Stream


Another common problem is counting the number of distinct elements that we have seen in the stream so far.
IG

Again the assumption here is that the universal set of all elements is too large to keep in memory, so we’ll need
to find another way to count how many distinct values we’ve seen. If we’re okay with simply getting an estimate
instead of the actual value, we can use the Flajolet-Martin (FM) algorithm.
V

Flajolet-Martin Algorithm
In the FM algorithm we hash the elements of a strea into a bit string. A bit string is a sequence of zeros and
ones, such as 1011000111010100100. A bit string of length � can hold 2� possible combinations. For the FM
algorithm to work, we need � to be large enough such that the bit string will have more possible combinations
than there are elements in the universal set. This basically means that there should be no possible collisions
when we has the elements into a bit string.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
The idea behind the FM algorithm is that the more distinct elements we see, the higher the likelihood that one
of their hash values will be “unusual”. The specific “unusualness” we will exploit here is that the bit string ends
in many consecutive 0s.

For example, the bit string 1011000111010100100 ends with 2 consecutive zeros. We call this value of 2 the tail
length of the bit string. Now let � be the maximum tail length of that we have seen of any hashed bit string of
the stream. The estimate of the number of distinct elements using FM is simply 2�.

S
#Library function for non-streams
def flajoletMartin(iterator):

TE
max_tail_length = 0

for val in iterator:

O
bit_string = bin(hash(hashlib.md5(val.encode('utf-8')).hexdigest()))

i = len(bit_string) - 1
tail_length = 0
while i >= 0:
N
K
if bit_string[i] == '0':
tail_length += 1
else:
H

#neatly handles the '0b' prefix of the binary string too.


#Just break when we see "b"
S

break
E

i -= 1
N

if tail_length > max_tail_length:


max_tail_length = tail_length
IG

return (2**max_tail_length)
V

testList = []
n=0
while n < 100000:
n += 1
testList.append(np.random.choice(words))

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
print(flajoletMartin(iter(testList)))

65536

Using Multiple Hash Functions


We can improve the estimate further by using multiple hash functions. With multiple hash functions we first
split hash functions into groups, get the maximum tail length for each hash function, then get the average for
each group. Lasly, we get the median over all the averages and that will be our estimate.

S
This way we can get estimates that aren’t just powers of 2. If the correct count is between two large powers of

TE
2, for example 7000, it will be impossible to get a good estimate using just one hash function.

O
N
K
H
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
K
H
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
Decaying Windows

N
We have assumed that a sliding window held a certain tail of the stream, either the most recent N elements for
fixed N , or all the elements that arrived after some time in the past. Sometimes we do not want to make a
sharp distinction between recent elements and those in the distant past, but want to weight the recent
K
elements more heavily. In this section, we consider “exponentially decaying windows,” and an application
where they are quite useful: finding the most common “recent” elements.
H

4.7.1 The Problem of Most-Common Elements


S

Suppose we have a stream whose elements are the movie tickets purchased all over the world, with the name
of the movie as part of the element. We want to keep a summary of the stream that is the most popular movies
E

“currently.”

While the notion of “currently” is imprecise, intuitively, we want to discount the popularity of a movie like Star
N

Wars–Episode 4, which sold many tickets, but most of these were sold decades ago. On the other hand, a movie
that sold n tickets in each of the last 10 weeks is probably more popular than a movie that sold 2n tickets last
IG

week but nothing in previous weeks.

One solution would be to imagine a bit stream for each movie, and give it value 1 if the ticket is for that movie,
V

and 0 otherwise. Pick a window size

N , which is the number of most recent tickets that would be considered in evaluating popularity. Then, use the
method of Section 4.6 to estimate the number of tickets for each movie, and rank movies by their estimated
counts.

This technique might work for movies, because there are only thousands of movies, but it would fail if we were
instead recording the popularity of items sold at Amazon, or the rate at which different Twitter-users tweet,

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
because there are too many Amazon products and too many tweeters. Further, it only offers approximate
answers.

4.7.2 Definition of the Decaying Window

An alternative approach is to redefine the question so that we are not asking for a count of 1’s in a window.
Rather, let us compute a smooth aggregation of all the 1’s ever seen in the stream, with decaying weights, so
the further back in the stream, the less weight is given. Formally, let a stream currently consist of the elements

S
a1, a2, . . . , at, where a1 is the first element to arrive and at is the current element. Let c be a small constant,
such as 10−6 or 10−9 . Define the exponentially decaying window for this stream to be the sum

TE
t−1

O
i=0

at−i(1− c) i

N
The effect of this definition is to spread out the weights of the stream el-ements as far back in time as the
stream goes. In contrast, a fixed window with the same sum of the weights, 1/c, would put equal weight 1 on
K
each of the most recent 1/c elements to arrive and weight 0 on all previous elements. The distinction is
suggested by Fig. 4.4.

Window of length 1/c


H

Figure 4.4: A decaying window and a fixed-length window of equal weight It is much easier to adjust the sum in
S

an exponentially decaying window than in a sliding window of fixed length. In the sliding window, we have to
worry about the element that falls out of the window each time a new element arrives. That forces us to keep
E

the exact elements along with the sum, or to use an approximation scheme such as DGIM. However, when a
new element at+1
N

arrives at the stream input, all we need to do is:


IG

1. Multiply the current sum by 1− c.

4.7. DECAYING WINDOWS 141 2. Add at+1.


V

The reason this method works is that each of the previous elements has now moved one position further from
the current element, so its weight is multiplied by 1− c. Further, the weight on the current element is (1 − c) 0 = 1,
so adding at+1is the correct way to include the new element’s contribution.

4.7.3 Finding the Most Popular Elements

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Let us return to the problem of finding the most popular movies in a stream of ticket sales.5 We shall use an
exponentially decaying window with a constant c, which you might think of as 10 −9 . That is, we approximate a
sliding window holding the last one billion ticket sales. For each movie, we imagine a separate stream with a 1
each time a ticket for that movie appears in the stream, and a 0 each time a ticket for some other movie arrives.
The decaying sum of the 1’s measures the current popularity of the movie.

We imagine that the number of possible movies in the stream is huge, so we do not want to record values for
the unpopular movies. Therefore, we establish a threshold, say 1/2, so that if the popularity score for a movie

S
goes below this number, its score is dropped from the counting. For reasons that will become obvious, the
threshold must be less than 1, although it can be any number less than 1. When a new ticket arrives on the

TE
stream, do the following:

1. For each movie whose score we are currently maintaining, multiply its score by (1− c).

2. Suppose the new ticket is for movie M . If there is currently a score for M , add 1 to that score. If there is no

O
score for M , create one and initialize it to 1.

3. If any score is below the threshold 1/2, drop that score.

N
It may not be obvious that the number of movies whose scores are main-tained at any time is limited. However,
note that the sum of all scores is 1/c.
K
There cannot be more than 2/c movies with score of 1/2 or more, or else the sum of the scores would exceed
1/c. Thus, 2/c is a limit on the number of movies being counted at any time. Of course in practice, the ticket
H

sales would be concentrated on only a small number of movies at any time, so the number of actively counted
movies would be much less than 2/c.
S
E
N

Real-time analytics on big data architecture


IG

Solution ideas

This article is a solution idea. If you'd like us to expand the content with more information, such as potential use
cases, alternative services, implementation considerations, or pricing guidance, let us know by providing GitHub
V

feedback.

This solution idea describes how you can get insights from live streaming data. Capture data continuously from
any IoT device, or logs from website clickstreams, and process it in near-real time.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Architecture

S
TE
O
N
K
Download a Visio file of this architecture.
H

Dataflow
S

1. Easily ingest live streaming data for an application, by using Azure Event Hubs.
2. Bring together all your structured data using Synapse Pipelines to Azure Blob Storage.
E

3. Take advantage of Apache Spark pools to clean, transform, and analyze the streaming data, and
combine it with structured data from operational databases or data warehouses.
N

4. Use scalable machine learning/deep learning techniques, to derive deeper insights from this
data, using Python, Scala, or .NET, with notebook experiences in Apache Spark pools.
IG

5. Apply Apache Spark pool and Synapse Pipelines in Azure Synapse Analytics to access and move
data at scale.
V

6. Build analytics dashboards and embedded reports in dedicated SQL pool to share insights within
your organization and use Azure Analysis Services to serve this data to thousands of users.
7. Take the insights from Apache Spark pools to Azure Cosmos DB to make them accessible
through real time apps.

Components

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
 Azure Synapse Analytics is the fast, flexible, and trusted cloud data warehouse that lets you
scale, compute, and store elastically and independently, with a massively parallel processing
architecture.
 Synapse Pipelines Documentation allows you to create, schedule, and orchestrate your ETL/ELT
workflows.
 Azure Data Lake Storage: Massively scalable, secure data lake functionality built on Azure Blob
Storage

S
 Azure Synapse Analytics Spark pools is a fast, easy, and collaborative Apache Spark-based
analytics platform.

TE
 Azure Azure Event Hubs Documentation is a big data streaming platform and event ingestion
service.
 Azure Cosmos DB is a globally distributed, multi-model database service. Then learn how to
replicate your data across any number of Azure regions and scale your throughput independent

O
from your storage.
 Azure Synapse Link for Azure Cosmos DB enables you to run near real-time analytics over

N
operational data in Azure Cosmos DB, without any performance or cost impact on your
transactional workload, by using the two analytics engines available from your Azure Synapse
workspace: SQL Serverless and Spark Pools.
K
 Azure Analysis Services is an enterprise grade analytics as a service that lets you govern, deploy,
test, and deliver your BI solution with confidence.
H

 Power BI is a suite of business analytics tools that deliver insights throughout your organization.
Connect to hundreds of data sources, simplify data prep, and drive unplanned analysis. Produce
S

beautiful reports, then publish them for your organization to consume on the web and across
mobile devices.
E

Alternatives
N

 Synapse Link is the Microsoft preferred solution for analytics on top of Azure Cosmos DB data.
IG

 Azure IoT Hub can be used instead of Azure Event Hubs. IoT Hub is a managed service hosted in
the cloud that acts as a central message hub for communication between an IoT application and
its attached devices. You can connect millions of devices and their backend solutions reliably
V

and securely. Almost any device can be connected to an IoT hub.

Scenario details

This scenario illustrates how you can get insights from live streaming data. You can capture data continuously
from any IoT device, or logs from website clickstreams, and process it in near-real time.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Potential use cases

This solution is ideal for the media and entertainment industry. The scenario is for building analytics from live
streaming data.

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding

S
tenets that can be used to improve the quality of a workload. For more information, see Microsoft Azure Well-

TE
Architected Framework.

Cost optimization

O
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational
efficiencies. For more information, see Overview of the cost optimization pillar.

The Operation of Real-time Analytics


N
K
Real-time analytics tools for data analytics can pull or push. Streaming demands that faculty push huge amounts
of fast-moving data. If streaming consumes too many resources and isn't an empirical process, data could be
moved at intervals between a couple of seconds and hours. The two may occur between business requirements
H

that need to be figured out in order not to interrupt the flow. The time to react for real-time analysis can vary
from nearly instantaneous to a few minutes or seconds. The key components of real-time analytics comprise the
S

following.
E

o Aggregator

o Broker
N

o Analytics engine

o Stream processor
IG

Benefits of Real-time Analytics


V

Momentum is the primary benefit of real-time analysis of data. The shorter a company has to wait for data from
the moment it arrives and is processed, and the business is able to utilize data insights to make changes and make
the results of a crucial decision.

In the same way, real-time analytics tools allow companies to see how users connect to an item after liberating
the product, so there's no problem in understanding the behaviour of users to make the necessary adjustments.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Advantages of Real-time Analytics:

Real-time analytics provides the benefits over traditional analytics.

o Create our interactive analytics tools.

o Transparent dashboards allow users to share information.

o Monitor behaviour in a way that is customized.

S
Perform immediate adjustments if necessary.

o Make use of machine learning.

TE
Other Benefits:

Other advantages and benefits include managing data location, detecting irregularities, enhancing marketing and

O
sales, etc. The following benefits can be useful.

Real Time Sentiment Analysis N


K
What Is Sentiment Analysis?

Sentiment analysis is a text analysis tool that uses machine learning with natural language processing (NLP) to
H

automatically read and classify text as positive, negative, neutral, and everywhere in between. It can read all
manner of text (online and elsewhere) for opinion and emotion – to understand the thoughts and feelings of
S

the writer.
E

See the example below from a pre-trained sentiment analyzer, which easily classifies a customer comment as
negative with near 100% confidence.
N

Test with your own text


IG
V

Classify Text

Results

TAGCONFIDENCE

Negative100.0%

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Sentiment analysis can be put to work on hundreds of pages and thousands of individual opinions in just
seconds, and constantly monitor Twitter, Facebook, emails, customer service tickets, etc., 24/7 and in real time.

Real-time sentiment gives you a window into what your customers and the public at large are expressing about
your brand “right now” for targeted, minute-by-minute analysis, and to follow brand sentiment over time.

Some of the benefits of performing real-time sentiment analysis:

S
Marketing campaign success analysis

TE
Target your analysis to follow marketing campaigns right as they launch and get a solid idea of how your
messaging is working with current and potentially new customers. Find out which demographics respond most
positively or negatively. Follow sentiment as it rises or falls, and compare current campaigns against previous
ones.

O
Prevention of business disasters

N
In 2017, United Airlines forcibly removed a passenger from an overbooked flight. Other passengers posted
videos of the incident to Facebook, one of which had been viewed 6.8 million times just 24 hours later. After
K
United’s CEO responded to the incident as "reaccommoda[ting] these customers," Twitter exploded in outrage
and public shaming of United.
H

Negative comments on social media can travel around the world in just minutes. Real-time sentiment analysis of
Twitter data, for example, will allow you to put out fires from negative comments before they grow out of
S

control, or use positive comments to your advantage. Oftentimes it’s helpful just to let them know you’re
listening:
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
O
N
K
H
S
E
N
IG

Instant product feedback

You can similarly follow feedback on new products right as they’re released. Influencers (and regular social
V

media users) are eager to be the first commenters upon the release of new products or updates. Follow social
media and online reviews to tweak products or beta releases right after release, or stimulate conversations with
your customers, so they always know they’re important to you.

You can even use social media sentiment analysis for market research to find out what’s missing from the
market or for competitive research to exploit the shortcomings of your competition and create new products.

Stock market predictions

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Follow the real-time sentiment of any business as it rises and falls to get up-to-the-minute information on stock
price changes. If a new product release is met with enthusiasm across the board, you can expect the stock to
rise. While a social media PR crisis can bring even industry giants to their knees.

Tutorial: How to Perform Sentiment Analysis in Real Time

1. Set your goals


2. Gather your data

S
3. Clean your data
4. Analyze & visualize sentiments in real-time

TE
5. Act on your results

There are two options when it comes to performing sentiment analysis: build a model or invest in a SaaS tool.

O
Building a model can produce exceptional results, but it is time-consuming and costly.

N
SaaS tools, on the other hand, are generally ready to put into use right away, much less expensive, and you can
still train custom models to the specific language, needs, and criteria of your organization.
K
MonkeyLearn’s powerful SaaS platform offers immediate access to sentiment analysis tools and other text
analytics techniques, like the keyword extractor, survey feedback classifier, intent and email classifier, and many
many more.
H

And with MonkeyLearn Studio, you can analyze and visualize your results in real time.
S

Let’s take a look at how easy it can be to perform real-time sentiment analysis.
E

1. Set your goals


N

First, decide what you want to achieve. Do you want to compare sentiment toward your brand against that of
your competition? Do you want to regularly mine Twitter or perform social listening to extract brand mentions
IG

and follow your brand sentiment from minute to minute?

Maybe you need to automatically analyze email exchanges and customer support tickets to get an idea of how
V

well your customer service is working. The use cases for real-time sentiment analysis are practically endless
when you have the right tools in place.

2. Gather your data

There are a number of ways to get the data you need, from simply cutting and pasting, to using APIs. Below are
some of the most common and easiest to use.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Tools like Zapier easily integrate with MonkeyLearn to pull brand mentions from Twitter or other outlets of your
choice.

Web mining and web scraping tools, like Dexi, Content Grabber, and Pattern allow you to link APIs or extract
content directly from the web into CSV or Excel files, and more.

APIs:

S
 The Graph API is best for pulling data directly from Facebook.
 Twitter’s API allows users access to public Twitter data.

TE
 The Python Reddit API Wrapper scrapes data from subreddits, accesses comments from specific posts,
and more.

3. Clean your data

O
Website, social media, and email data often have quite a bit of “noise.” This can be repetitive text, banner ads,

N
non-text symbols and emojis, email signatures, etc. You need to first remove this unnecessary data, or it will
skew your results.
K
You can run spell check or scan documents for URLs and symbols, but you’re much better off automating this
process – especially for accurate real-time analysis – because time is of the essence, and manual data cleaning
will create an information bottleneck.
H

MonkeyLearn offers several models to make data cleaning quick and easy. The boilerplate extractor extracts
S

only the text you want from HTML, removing unneeded clutter, like templates, navigation bars, ads, etc.
E

The email cleaner automatically removes email signatures, legal notices, and previous replies to give you only
the most recent message in the chain:
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
And the opinion units extractor breaks up sentences or entire pages of text into individual sentiments or
thoughts called “opinion units”:

S
TE
O
N
K
It can break down hundreds of pages and thousands of opinion units automatically to prep your data for
H

analysis.
S

4. Analyze & visualize sentiments in real time


E

MonkeyLearn Studio is an all-in-one real-time sentiment analysis and visualization tool. After a simple set-up,
you just upload your data and visualize the results for powerful insights.
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS

S
TE
4.1. Choose a MonkeyLearn Studio template

O
MonkeyLearn Studio allows you to chain together a number of text analysis techniques, like keyword

N
extraction, aspect classification, intent classification, and more, along with your real-time sentiment analysis, for
super fine-grained results.
K
If you want to learn how to build a custom sentiment analysis model to your specific criteria (and then use it
with MonkeyLearn Studio), take a look at this tutorial. You can do it in just a few steps.
H

Once you’re ready for MonkeyLearn Studio, you can choose an existing template or create your own:
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
4.2. Upload your data

You can upload cleaned text from a CSV or Excel file, connect to integrations with Zendesk, SurveyMonkey, etc.,
or use simple, low-code APIs to extract directly from social media, websites, email, and more.

S
TE
O
N
K
4.3. Run your analysis
H

As you can see below, the model automatically tags the statement for Sentiment, Category, and Intent, all
working simultaneously.
S
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
4.4. Automatically visualize your results with MonkeyLearn Studio

MonkeyLearn Studio’s deep learning models are able to chain together a number of text analysis techniques in a
seamless process, so you just set it up and let it do the work for you. Once your real-time sentiment analyzer is
trained to your criteria, it can perform analysis 24/7, with limitless accuracy.

Take a look at the MonkeyLearn Studio dashboard below. In this case we ran aspect-based sentiment
analysis on customer reviews of Zoom. Each opinion unit is categorized by “aspect” or category: Usability,

S
Support, Reliability, etc., then each category is run through sentiment analysis to show opinion from positive to
negative.

TE
O
N
K
H
S
E

You can see how individual reviews have been pulled by date and time for real-time analysis, and to follow
N

categories and sentiments as they change over time.

Another analysis for “intent,” shows the reason for the comment. This is more often used to analyze emails and
IG

customer service data. In this case, as this is an analysis of customer reviews, most are simply marked as
“opinion.”
V

5. Act on your results

The results are in! With sentiment analysis and MonkeyLearn Studio, you can be confident you’re making real-
time, data-driven decisions.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Imagine you release a new product. You can perform real-time aspect-based sentiment analysis on Twitter
mentions of your product, for example, to find out what aspect your customers are responding to most
favorably or unfavorably.

Play around with the public dashboard to see how it works: search by date, sentiment, category, etc. With
MonkeyLearn Studio you can perform new analyses and add or remove data directly in the dashboard. No more
uploading and downloading data between applications – it’s all right there.

S
TE
Stock Market Prediction

O
Introduction
Stock market prediction and analysis are some of the most difficult jobs to complete. There are numerous causes

N
for this, including market volatility and a variety of other dependent and independent variables that influence the
value of a certain stock in the market. These variables make it extremely difficult for any stock market expert to
anticipate the rise and fall of the market with great precision. Considered among the most potent tree-based
K
techniques, Random Forest can predict the stock process as it can also solve regression-based problems.

However, this tutorial, with the introduction of Data Science, Machine Learning, and artificial intelligence and its
H

strong algorithms, the most recent market research, and Stock price Prediction advancements have begun to
include such approaches in analyzing stock market data.
S

Source: moneycontrol.com
E
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
In summary, Machine Learning Algorithms like regression, classifier, and support vector machine (SVM) are widely
utilized by many organizations in stock market prediction. This article will walk through a simple implementation
of analyzing and forecasting the stock prices of a Popular Worldwide Online Retail Store in Python using various
Machine Learning Algorithms.

Learning Objectives

 In this tutorial, we will learn about the best ways possible to predict stock prices using a long-short-term

S
memory (LSTM) for time series forecasting.
 We will learn everything about stock market prediction using LSTM.

TE
Problem Statement for Stock Market Prediction
Let us see the data on which we will be working before we begin implementing the software to anticipate stock

O
market values. In this section, we will examine the stock price of Microsoft Corporation (MSFT) as reported by the
National Association of Securities Dealers Automated Quotations (NASDAQ). The stock price data will be supplied

N
as a Comma Separated File (.csv) that may be opened and analyzed in Excel or a Spreadsheet.

MSFT’s stocks are listed on NASDAQ, and their value is updated every working day of the stock market. It should
K
be noted that the market does not allow trading on Saturdays and Sundays. Therefore, there is a gap between
the two dates. The Opening Value of the stock, the Highest and Lowest values of that stock on the same day, as
well as the Closing Value at the end of the day are all indicated for each date.
H

The Adjusted Close Value reflects the stock’s value after dividends have been declared (too technical!).
S

Furthermore, the total volume of the stocks in the market is provided. With this information, it is up to the job of
a Machine Learning/Data Scientist to look at the data and develop different algorithms that may extract patterns
E

from the historical data of the Microsoft Corporation stock.


N

Stock Market Prediction Using the Long Short-Term Memory Method


We will use the Long Short-Term Memory(LSTM) method to create a Machine Learning model to forecast
IG

Microsoft Corporation stock values. They are used to make minor changes to the information by multiplying and
adding. Long-term memory (LSTM) is a deep learning artificial recurrent neural network (RNN) architecture.
V

Unlike traditional feed-forward neural networks, LSTM has feedback connections. It can handle single data points
(such as pictures) as well as full data sequences (such as speech or video).

Program Implementation
We will now go to the section where we will utilize Machine Learning techniques in Python to estimate the stock
value using the LSTM.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Step 1: Importing the Libraries

As we all know, the first step is to import the libraries required to preprocess Microsoft Corporation stock data
and the other libraries required for constructing and visualizing the LSTM model outputs. We’ll be using the Keras
library from the TensorFlow framework for this. All modules are imported from the Keras library.

#Importing the Libraries


import pandas as PD

S
import NumPy as np
%matplotlib inline

TE
import matplotlib. pyplot as plt
import matplotlib
from sklearn. Preprocessing import MinMaxScaler

O
from Keras. layers import LSTM, Dense, Dropout
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib. dates as mandates
from sklearn. Preprocessing import MinMaxScaler
N
K
from sklearn import linear_model
from Keras. Models import Sequential
from Keras. Layers import Dense
H

import Keras. Backend as K


from Keras. Callbacks import EarlyStopping
S

from Keras. Optimisers import Adam


from Keras. Models import load_model
E

from Keras. Layers import LSTM


from Keras. utils.vis_utils import plot_model
N

Step 2: Getting to Visualising the Stock Market Prediction Data


IG

Using the Pandas Data Reader library, we will upload the stock data from the local system as a Comma Separated
Value (.csv) file and save it to a pandas DataFrame. Finally, we will examine the data.
V

#Get the Dataset


df=pd.read_csv(“MicrosoftStockData.csv”,na_values=[‘null’],index_col=’Date’,parse_dates=True,infer_datetime
_format=True)
df.head()
Step 3: Checking for Null Values by Printing the DataFrame Shape

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
In this step, firstly, we will print the structure of the dataset. We’ll then check for null values in the data frame to
ensure that there are none. The existence of null values in the dataset causes issues during training since they
function as outliers, creating a wide variance in the training process.

#Print the shape of Dataframe and Check for Null Values


print(“Dataframe Shape: “, df. shape)
print(“Null Value Present: “, df.IsNull().values.any())
Output:

S
>> Dataframe Shape: (7334, 6)
>>Null Value Present: False

TE
Date Open High Low Close Adj Close

1990-01-02 0.605903 0.616319 0.598090 0.616319 0.447268

O
1990-01-03 0.621528 0.626736 0.614583 0.619792 0.449788

1990-01-04 0.619792 0.638889 0.616319 0.638021 0.463017

1990-01-05 0.635417 0.638889 N


0.621528 0.622396 0.451678
K
1990-01-08 0.621528 0.631944 0.614583 0.631944 0.458607

Step 4: Plotting the True Adjusted Close Value


H

The Adjusted Close Value is the final output value that will be forecasted using the Machine Learning model. This
figure indicates the stock’s closing price on that particular day of stock market trading.
S

#Plot the True Adj Close Value


E

df[‘Adj Close’].plot()
N
IG
V

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Step 5: Setting the Target Variable and Selecting the Features

The output column is then assigned to the target variable in the following step. It is the adjusted relative value of
Microsoft Stock in this situation. Furthermore, we pick the features that serve as the independent variable to the
target variable (dependent variable). We choose four characteristics to account for training purposes:

 Open
 High

S
 Low
 Volume

TE
#Set Target Variable
output_var = PD.DataFrame(df[‘Adj Close’])

O
#Selecting the Features
features = [‘Open’, ‘High’, ‘Low’, ‘Volume’]
Step 6: Scaling

N
To decrease the computational cost of the data in the table, we will scale the stock values to values between 0
and 1. As a result, all of the data in large numbers is reduced, and therefore memory consumption is decreased.
K
Also, because the data is not spread out in huge values, we can achieve greater precision by scaling down. To
perform this, we will be using the MinMaxScaler class of the sci-kit-learn library.
H

#Scaling
scaler = MinMaxScaler()
S

feature_transform = scaler.fit_transform(df[features])
feature_transform= pd.DataFrame(columns=features, data=feature_transform, index=df.index)
E

feature_transform.head()
N

Date Open High Low Volume

1990-01-02 0.000129 0.000105 0.000129 0.064837


IG

1990-01-03 0.000265 0.000195 0.000273 0.144673

1990-01-04 0.000249 0.000300 0.000288 0.160404


V

1990-01-05 0.000386 0.000300 0.000334 0.086566

1990-01-08 0.000265 0.000240 0.000273 0.072656

As shown in the above table, the values of the feature variables are scaled down to lower values when compared
to the real values given above.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
Step 7: Creating a Training Set and a Test Set for Stock Market Prediction

We must divide the entire dataset into training and test sets before feeding it into the training model. The
Machine Learning LSTM model will be trained on the data in the training set and tested for accuracy and
backpropagation on the test set.

The sci-kit-learn library’s TimeSeriesSplit class will be used for this. We set the number of splits to 10, indicating
that 10% of the data will be used as the test set and 90% of the data will be used to train the LSTM model. The

S
advantage of utilizing this Time Series split is that the split time series data samples are examined at regular time
intervals.

TE
#Splitting to Training set and Test set
timesplit= TimeSeriesSplit(n_splits=10)
for train_index, test_index in timesplit.split(feature_transform):

O
X_train, X_test = feature_transform[:len(train_index)], feature_transform[len(train_index):
(len(train_index)+len(test_index))]

(len(train_index)+len(test_index))].values.ravel()
Step 8: Data Processing For LSTM
N
y_train, y_test = output_var[:len(train_index)].values.ravel(), output_var[len(train_index):
K
Once the training and test sets are finalized, we will input the data into the LSTM model. Before we can do that,
we must transform the training and test set data into a format that the LSTM model can interpret. As the LSTM
H

needs that the data to be provided in the 3D form, we first transform the training and test data to NumPy arrays
and then restructure them to match the format (Number of Samples, 1, Number of Features). Now, 6667 are the
S

number of samples in the training set, which is 90% of 7334, and the number of features is 4. Therefore, the
E

training set is reshaped to reflect this (6667, 1, 4). Likewise, the test set is reshaped.

#Process the data for LSTM


N

trainX =np.array(X_train)
testX =np.array(X_test)
IG

X_train = trainX.reshape(X_train.shape[0], 1, X_train.shape[1])


X_test = testX.reshape(X_test.shape[0], 1, X_test.shape[1])
V

Step 9: Building the LSTM Model for Stock Market Prediction

Finally, we arrive at the point when we construct the LSTM Model. In this step, we’ll build a Sequential Keras
model with one LSTM layer. The LSTM layer has 32 units and is followed by one Dense Layer of one neuron.

We compile the model using Adam Optimizer and the Mean Squared Error as the loss function. For an LSTM
model, this is the most preferred combination. The model is plotted and presented below.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
#Building the LSTM Model
lstm = Sequential()
lstm.add(LSTM(32, input_shape=(1, trainX.shape[1]), activation=’relu’, return_sequences=False))
lstm.add(Dense(1))
lstm.compile(loss=’mean_squared_error’, optimizer=’adam’)
plot_model(lstm, show_shapes=True, show_layer_names=True)

S
TE
O
N
K
Step 10: Training the Stock Market Prediction Model
H

Finally, we use the fit function to train the LSTM model created above on the training data for 100 epochs with a
batch size of 8.
S

#Model Training
E

history=lstm.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, shuffle=False)


Eросh 1/100
N

834/834 [==============================] – 3s 2ms/steр – lоss: 67.1211


Eросh 2/100
IG

834/834 [==============================] – 1s 2ms/steр – lоss: 70.4911


Eросh 3/100
834/834 [==============================] – 1s 2ms/steр – lоss: 48.8155
V

Eросh 4/100
834/834 [==============================] – 1s 2ms/steр – lоss: 21.5447
Eросh 5/100
834/834 [==============================] – 1s 2ms/steр – lоss: 6.1709
Eросh 6/100
834/834 [==============================] – 1s 2ms/steр – lоss: 1.8726
Eросh 7/100

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
834/834 [==============================] – 1s 2ms/steр – lоss: 0.9380
Eросh 8/100
834/834 [==============================] – 2s 2ms/steр – lоss: 0.6566
Eросh 9/100
834/834 [==============================] – 1s 2ms/steр – lоss: 0.5369
Eросh 10/100
834/834 [==============================] – 2s 2ms/steр – lоss: 0.4761

S
.
.

TE
.
.
Eросh 95/100
834/834 [==============================] – 1s 2ms/steр – lоss: 0.4542

O
Eросh 96/100
834/834 [==============================] – 2s 2ms/steр – lоss: 0.4553
Eросh 97/100
834/834 [==============================] – 1s 2ms/steр – lоss: 0.4565
Eросh 98/100
N
K
834/834 [==============================] – 1s 2ms/steр – lоss: 0.4576
Eросh 99/100
834/834 [==============================] – 1s 2ms/steр – lоss: 0.4588
H

Eросh 100/100
834/834 [==============================] – 1s 2ms/steр – lоss: 0.4599
S

Finally, we can observe that the loss value has dropped exponentially over time over the 100-epoch training
E

procedure, reaching a value of 0.4599.

Step 11: Making the LSTM Prediction


N

Now that we have our model ready, we can use it to forecast the Adjacent Close Value of the Microsoft stock by
IG

using a model trained using the LSTM network on the test set. This is accomplished by employing the simple
predict function on the LSTM model that has been created.
V

#LSTM Prediction
y_pred= lstm.predict(X_test)
Step 12: Comparing Predicted vs True Adjusted Close Value – LSTM

Finally, now that we’ve projected the values for the test set, we can display the graph to compare both Adj Close’s
true values and Adj Close’s predicted value using the LSTM Machine Learning model.

Downloaded by Thasnim K ([email protected])


lOMoARcPSD|31939554

VIGNESH K NOTES
DS4015 BIG DATA ANALYTICS
#Predicted vs True Adj Close Value – LSTM
plt.plot(y_test, label=’True Value’)
plt.plot(y_pred, label=’LSTM Value’)
plt.title(“Prediction by LSTM”)
plt.xlabel(‘Time Scale’)
plt.ylabel(‘Scaled USD’)
plt.legend()

S
plt.show()

TE
O
N
K
H

The graph above demonstrates that the extremely basic single LSTM network model created above detects some
patterns. We may get a more accurate depiction of every specific company’s stock value by fine-tuning many
S

parameters and adding more LSTM layers to the model.


E
N
IG
V

Downloaded by Thasnim K ([email protected])

You might also like