Data Analytics pdf
Data Analytics pdf
on
B. Tech. Semester -V
The data which is collected is known as raw data which is not useful now but on cleaning the
impure and utilizing that data for further analysis forms information, the information obtained is
known as “knowledge”. Knowledge has many meanings like business knowledge or sales of
enterprise products, disease treatment, etc. The main goal of data collection is to collect
information-rich data.
Data collection starts with asking some questions such as what type of data is to be collected and
what is the source of collection. Most of the data collected are of two types known as “qualitative
data“ which is a group of non-numerical data such as words, sentences mostly focus on behavior
and actions of the group and another one is “quantitative data” which is in numerical forms and can
be calculated using different scientific tools and sampling data.
The actual data is then further divided mainly into two types known as:
Primary data
Secondary data
1. Primary Data:
The data which is Raw, original, and extracted directly from the official sources is known as
primary data. This type of data is collected directly by performing techniques such as
questionnaires, interviews, and surveys. The data collected must be according to the demand and
requirements of the target audience on which analysis is performed otherwise it would be a burden
in the data processing.
a. Interview method:
The data collected during this process is through interviewing the target audience by a person called
interviewer and the person who answers the interview is known as the interviewee. Some basic
business or product related questions are asked and noted down in the form of notes, audio, or video
and this data is stored for processing. These can be both structured and unstructured like personal
interviews or formal interviews through telephone, face to face, email, etc.
b. Survey method:
The survey method is the process of research where a list of relevant questions are asked and
answers are noted down in the form of text, audio, or video. The survey method can be obtained in
both online and offline mode like through website forms and email. Then that survey answers are
stored for analyzing data. Examples are online surveys or surveys through social media polls.
c. Observation method:
The observation method is a method of data collection in which the researcher keenly observes the
behavior and practices of the target audience using some data collecting tool and stores the observed
data in the form of text, audio, video, or any raw formats. In this method, the data is collected
directly by posting a few questions on the participants. For example, observing a group of
customers and their behavior towards the products. The data obtained will be sent for processing.
d. Experimental method:
The experimental method is the process of collecting data through performing experiments,
research, and investigation. The most frequently used experiment methods are CRD, RBD,
LSD, FD.
2. Secondary data:
Secondary data is the data which has already been collected and reused again for some valid
purpose. This type of data is previously recorded from primary data and it has two types of sources
named internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market record, a sales
record, transactions, customer data, accounting resources, etc. The cost and time consumption is
less in obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be gained through external third
party resources is external source data. The cost and time consumption is more because this
contains a huge amount of data. Examples of external sources are Government publications, news
publications, Registrar General of India, planning commission, international labor bureau, syndicate
services, and other non-governmental publications.
Other sources:
Sensors data: With the advancement of IoT devices, the sensors of these devices collect
data which can be used for sensor data analytics to track the performance and usage of
products.
Satellites data: Satellites collect a lot of images and data in terabytes on daily basis through
surveillance cameras which can be used to collect useful information.
Web traffic: Due to fast and cheap internet facilities many formats of data which is
uploaded by users on different platforms can be predicted and collected with their
permission for data analysis. The search engines also provide their data through keywords
and queries searched mostly.
Classification of data (structured, semi-structured,
unstructured)
Big Data includes huge volume, high velocity, and extensible variety of data. These are 3 types:
Structured data, Semi-structured data, and Unstructured data.
1. Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example: Relational data.
Examples Of Structured Data
An 'Employee' table in a database is an example of Structured Data
2.
2.
2.
2.
2.
Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that have
some organizational properties that make it easier to analyze. With some process, you can
store them in the relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.
Examples Of Semi-structured Data
Personal data stored in an XML file-
3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence and
analytics applications. Example: Word, PDF, Text, Media logs.
Accuracy and Precision: This characteristic refers to the exactness of the data. It cannot have any
erroneous elements and must convey the correct message without being misleading. This accuracy
and precision have a component that relates to its intended use. Without understanding how the data
will be consumed, ensuring accuracy and precision could be off-target or more costly than
necessary. For example, accuracy in healthcare might be more important than in another industry
(which is to say, inaccurate data in healthcare could have more serious consequences) and,
therefore, justifiably worth higher levels of investment.
Legitimacy and Validity: Requirements governing data set the boundaries of this characteristic.
For example, on surveys, items such as gender, ethnicity, and nationality are typically limited to a
set of options and open answers are not permitted. Any answers other than these would not be
considered valid or legitimate based on the survey’s requirement. This is the case for most data and
must be carefully considered when determining its quality. The people in each department in an
organization understand what data is valid or not to them, so the requirements must be leveraged
when evaluating data quality.
Reliability and Consistency: Many systems in today’s environments use and/or collect the same
source data. Regardless of what source collected the data or where it resides, it cannot contradict a
value residing in a different source or collected by a different system. There must be a stable and
steady mechanism that collects and stores the data without contradiction or unwarranted variance.
Timeliness and Relevance: There must be a valid reason to collect the data to justify the effort
required, which also means it has to be collected at the right moment in time. Data collected too
soon or too late could misrepresent a situation and drive inaccurate decisions.
Availability and Accessibility: This characteristic can be tricky at times due to legal and regulatory
constraints. Regardless of the challenge, though, individuals need the right level of access to the
data in order to perform their jobs. This presumes that the data exists and is available for access to
be granted.
Granularity and Uniqueness: The level of detail at which data is collected is important, because
confusion and inaccurate decisions can otherwise occur. Aggregated, summarized and manipulated
collections of data could offer a different meaning than the data implied at a lower level. An
appropriate level of granularity must be defined to provide sufficient uniqueness and distinctive
properties to become visible. This is a requirement for operations to function effectively.
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can store it
or process it efficiently. Big data is also a data but with huge size.
Volume
Variety
Velocity
Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
very crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one
characteristic which needs to be considered while dealing with Big Data.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
What Is Data Analytics?
Data analytics is the science of analyzing raw data in order to make conclusions about that
information. Many of the techniques and processes of data analytics have been automated into
mechanical processes and algorithms that work over raw data for human consumption
Data analytics techniques can reveal trends and metrics that would otherwise be lost in the mass of
information. This information can then be used to optimize processes to increase the overall
efficiency of a business or system.
1. The first step is to determine the data requirements or how the data is grouped. Data may be
separated by age, demographic, income, or gender. Data values may be numerical or be
divided by category.
2. The second step in data analytics is the process of collecting it. This can be done through a
variety of sources such as computers, online sources, cameras, environmental sources, or
through personnel.
3. Once the data is collected, it must be organized so it can be analyzed. Organization may take
place on a spreadsheet or other form of software that can take statistical data.
4. The data is then cleaned up before analysis. This means it is scrubbed and checked to ensure
there is no duplication or error, and that it is not incomplete. This step helps correct any
errors before it goes on to a data analyst to be analyzed.
As the decades have passed, data has moved far beyond the scale that people can handle manually.
The amount of data has grown at least as fast as the computing power of the machines that process
it. It may not be necessary to personally break a sweat and get a head-ache computing things by
hand, but it is still very easy to cause com-puter and storage systems to start steaming as they
struggle to process the data fed to the.
Traditional Analytic Architecture
MAPREDUCE
MapReduce is a parallel programming framework. It ’ s neither a data-base nor a direct competitor
to databases. This has not stopped some people from claiming it ’ s going to replace databases and
everything else under the sun. The reality is MapReduce is complementary to existing technologies.
There are a lot of tasks that can be done in a MapReduce environment that can also be done in a
relational database. What it comes down to is identifying which environment is better for the
problem at hand. Being able to do something with a tool or technology isn ’ t the same as being the
best way to do something. By focusing on what MapReduce is best for instead of what theoretically
can be done with it, it is possible to maximize the benefi ts received.
Data Analysis Process consists of the following phases that are iterative in nature −
Big data is the storage and analysis of large data sets. These are complex data sets which can be
both structured or unstructured. They are so large that it is not possible to work on them with
traditional analytical tools. These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring more efficiency in their
work environment.
There are many big data tools and processes being utilised by companies these days. These are used
in the processes of discovering insights and supporting decision making. The top big data tools used
these days are open source data tools, data visualization tools, sentiment tools, data extraction tools
and databases. Some of the best used big data tools are mentioned below –
1. R-Programming
R is a free open source software programming language and a software environment for statistical
computing and graphics. It is used by data miners for developing statistical software and data
analysis. It has become a highly popular tool for big data in recent years.
2. Datawrapper
It is an online data visualization tool for making interactive charts. You need to paste your data file
in a csv, pdf or excel format or paste it directly in the field. Datawrapper then generates any
visualization in the form of bar, line, map etc. It can be embedded into any other website as well. It
is easy to use and produces visually effective charts.
3. Tableau Public
Tableau is another popular big data tool. It is simple and very intuitive to use. It communicates the
insights of the data through data visualisation. Through Tableau, an analyst can check a hypothesis
and explore the data before starting to work on it extensively.
4. Content Grabber
Content Grabber is a data extraction tool. It is suitable for people with advanced programming
skills. It is a web crawling software. Businesses can use it to extract content and save it in a
structured format. It offers editing and debugging facility among many others for analysis later.
Analysis vs reporting
“Analytics” means raw data analysis. Typical analytics requests usually imply a once-off data
investigation., whereas, “Reporting” means data to inform decisions. Typical reporting requests
usually imply repeatable access to the information, which could be monthly, weekly, daily, or even
real-time.
3. Python: Python is an object-oriented scripting language which is easy to read, write, maintain
and is a free open source tool. It was developed by Guido van Rossum in late 1980’s which supports
both functional and structured programming methods.
Phython is easy to learn as it is very similar to JavaScript, Ruby, and PHP. Also, Python has very
good machine learning libraries viz. Scikitlearn, Theano, Tensorflow and Keras. Another important
feature of Python is that it can be assembled on any platform like SQL server, a MongoDB database
or JSON. Python can also handle text data very well.
4. SAS: Sas is a programming environment and language for data manipulation and a leader in
analytics, developed by the SAS Institute in 1966 and further developed in 1980’s and 1990’s. SAS
is easily accessible, managable and can analyze data from any sources. SAS introduced a large set
of products in 2011 for customer intelligence and numerous SAS modules for web, social media
and marketing analytics that is widely used for profiling customers and prospects. It can also predict
their behaviors, manage, and optimize communications.
5.Apache Spark: Apache Spark is a fast large-scale data processing engine and executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk. Spark is
built on data science and its concept makes data science effortless. Spark is also popular for data
pipelines and machine learning models development.
Spark also includes a library – MLlib, that provides a progressive set of machine algorithms for
repetitive data science techniques like Classification, Regression, Collaborative Filtering,
Clustering, etc.
6. Excel: Excel is a basic, popular and widely used analytical tool almost in all industries. Whether
you are an expert in Sas, R or Tableau, you will still need to use Excel. Excel becomes important
when there is a requirement of analytics on the client’s internal data. It analyzes the complex task
that summarizes the data with a preview of pivot tables that helps in filtering the data as per client
requirement. Excel has the advance business analytics option which helps in modelling capabilities
which have prebuilt options like automatic relationship detection, a creation of DAX measures and
time grouping.
7. RapidMiner: RapidMiner is a powerful integrated data science platform developed by the same
company that performs predictive analysis and other advanced analytics like data mining, text
analytics, machine learning and visual analytics without any programming. RapidMiner can
incorporate with any data source types, including Access, Excel, Microsoft SQL, Tera data, Oracle,
Sybase, IBM DB2, Ingres, MySQL, IBM SPSS, Dbase etc. The tool is very powerful that can
generate analytics based on real-life data transformation settings.
Applications of data analytics
1. Security
Data analytics applications or, more specifically, predictive analysis has also helped in dropping
crime rates in certain areas. In a few major cities like Los Angeles and Chicago, historical and
geographical data has been used to isolate specific areas where crime rates could surge. On that
basis, while arrests could not be made on a whim, police patrols could be increased. Thus, using
applications of data analytics, crime rates dropped in these areas.
2. Transportation
Data analytics can be used to revolutionize transportation. It can be used especially in areas where
you need to transport a large number of people to a specific area and require seamless
transportation. This data analytical technique was applied in the London Olympics a few years ago.
For this event, around 18 million journeys had to be made. So, the train operators and TFL were
able to use data from similar events, predict the number of people who would travel, and then
ensure that the transportation was kept smooth.
3. Risk detection
One of the first data analytics applications may have been in the discovery of fraud. Many
organizations were struggling under debt, and they wanted a solution to this problem. They already
had enough customer data in their hands, and so, they applied data analytics. They used ‘divide and
conquer’ policy with the data, analyzing recent expenditure, profiles, and any other important
information to understand any probability of a customer defaulting. Eventually, it led to lower risks
and fraud.
4. Risk Management
Risk management is an essential aspect in the world of insurance. While a person is being insured,
there is a lot of data analytics that goes on during the process. The risk involved while insuring the
person is based on several data like actuarial data and claims data, and the analysis of them helps
insurance companies to realize the risk.
5. Delivery
Several top logistic companies like DHL and FedEx are using data analysis to examine collected
data and improve their overall efficiency. Using data analytics applications, the companies were
able to find the best shipping routes, delivery time, as well as the most cost-efficient transport
means. Using GPS and accumulating data from the GPS gives them a huge advantage in data
analytics.
While it might seem that allocating fast internet in every area makes a city ‘Smart’, in reality, it is
more important to engage in smart allocation. This smart allocation would mean understanding how
bandwidth is being used in specific areas and for the right cause.
7. Reasonable Expenditure
When one is building Smart cities, it becomes difficult to plan it out in the right way. Remodelling
of the landmark or making any change would incur large amounts of expenditure, which might
eventually turn out to be a waste.
In insurance, there should be a healthy relationship between the claims handlers and customers.
Hence, to improve their services, many insurance companies often use customer surveys to collect
data. Since insurance companies target a diverse group of people, each demographic has their own
preference when it comes to communication.
9. Planning of cities
One of the untapped disciplines where data analysis can really grow is city planning. While many
city planners might be hesitant towards using data analysis in their favour, it only results in faulty
cities riddled congestion. Using data analysis would help in bettering accessibility and minimizing
overloading in the city.
10. Healthcare
While medicine has come a long way since ancient times and is ever-improving, it remains a costly
affair. Many hospitals are struggling with the cost pressures that modern healthcare has come with,
which includes the use of sophisticated machinery, medicines, etc.
But now, with the help of data analytics applications, healthcare facilities can track the treatment of
patients and patient flow as well as how equipment are being used in hospitals.
Need of Data Analytics Life Cycle
Data analytics is important because it helps businesses optimize their performances. ... A company
can also use data analytics to make better business decisions and help analyze customer trends and
satisfaction, which can lead to new—and better—products and services
1. Business User :
The business user is the one who understands the main area of the project and is also
basically benefited from the results.
This user gives advice and consult the team working on the project about the value of
the results obtained and how the operations on the outputs are done.
The business manager, line manager, or deep subject matter expert in the project
mains fulfills this role.
2. Project Sponsor :
The Project Sponsor is the one who is responsible to initiate the project. Project
Sponsor provides the actual requirements for the project and presents the basic
business issue.
He generally provides the funds and measures the degree of value from the final
output of the team working on the project.
This person introduce the prime concern and brooms the desired output.
3. Project Manager :
This person ensures that key milestone and purpose of the project is met on time and of
the expected quality.
DBA facilitates and arrange the database environment to support the analytics need of
the team working on a project.
6. Data Engineer :
Data engineer grasps deep technical skills to assist with tuning SQL queries for data
management and data extraction and provides support for data intake into the analytic
sandbox.
The data engineer works jointly with the data scientist to help build data in correct ways
for analysis.
7. Data Scientist :
Data scientist facilitates with the subject matter expertise for analytical techniques, data
modelling, and applying correct analytical techniques for a given business issues.
He ensures overall analytical objectives are met.
Data scientists outline and apply analytical methods and proceed towards the data
available for the concerned project.
Phase 1: Discovery –
The data science team learn and investigate the problem.
Develop context and understanding.
Come to know about data sources needed and available for the project.
The team formulates initial hypothesis that can be later tested with data.
Team explores data to learn about relationships between variables and subsequently, selects
key variables and the most suitable models.
In this phase, data science team develop data sets for training, testing, and production
purposes.
Team builds and executes models based on the work done in the model planning phase.
Several tools commonly used for this phase are – Matlab, STASTICA.
After executing model team need to compare outcomes of modeling to criteria established
for success and failure.
Team considers how best to articulate findings and outcomes to various team members and
stakeholders, taking into account warning, assumptions.
Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
Phase 6: Operationalize –
The team communicates benefits of project more broadly and sets up pilot project to deploy
work in controlled way before broadening the work to full enterprise of users.
This approach enables team to learn about performance and related constraints of the model
in production environment on small scale  , and make adjustments before full
deployment.
The team delivers final reports, briefings, codes.
Free or open source tools – Octave, WEKA, SQL, MADlib.
Unit-II
Data Analysis
Data Analysis: Data analysis is defined as a process of cleaning, transforming, and modeling data
to discover useful information for business decision-making. The purpose of Data Analysis is to
extract useful information from data and taking the decision based upon the data analysis.
Regression Modeling: Regression is a method to mathematically formulate relationship between
variables that in due course can be used to estimate, interpolate and extrapolate. Suppose we want to
estimate the weight of individuals, which is influenced by height, diet, workout, etc. Here, Weight is
the predicted variable. Height, Diet, Workout are predictor variables.
The predicted variable is a dependant variable in the sense that it depends on predictors. Predictors
are also called as independent variables. Regression reveals to what extent the predicted variable is
affected by the predictors. In other words, what amount of variation in predictors will result in
variations of the predicted variable. The predicted variable is mathematically represented as Y. The
predictor variables are represented as X1, X2, X3, etc. This mathematical relationship is often
called the regression model.
Regression models are widely used in analytics, in general being among the most easy to
understand and interpret type of analytics techniques. Regression techniques allow the identification
and estimation of possible relationships between a pattern or variable of interest, and factors that
influence that pattern. For example, a company may be interested in understanding the
effectiveness of its marketing strategies. It may deploy a variety of marketing activities in a given
time period, perhaps TV advertising, and print advertising, social media campaigns, radio
advertising and so on. A regression model can be used to understand and quantify which of its
marketing activities actually drive sales, and to what extent. The advantage of regression over
simple correlations is that it allows you to control for the simultaneous impact of multiple other
factors that influence your variable of interest, or the “target” variable. That is, in this example,
things like pricing changes or competitive activities also influence sales of the brand of interest, and
the regession model allows you to account for the impacts of these factors when you estimate the
true impact of say each type of marketing activity on sales.
Multivariate analysis: It is a set of statistical techniques used for analysis of data that contain
more than one variable. Multivariate analysis is typically used for:
Bayesian modeling: Bayesian analysis is a statistical paradigm that answers research questions
about unknown parameters using probability statements. For example,
What is the probability that the average male height is between 70 and 80 inches or that the
average female height is between 60 and 70 inches?
What is the probability that people in a particular state vote Republican or vote
Democratic?
What is the probability that treatment A is more cost effective than treatment B for a
specific health care provider?
What is the probability that a patient's blood pressure decreases if he or she is prescribed
drug A?
What is the probability that the odds ratio is between 0.3 and 0.5?
What is the probability that three out of five quiz questions will be answered correctly by
students?
What is the probability that children with ADHD underperform relative to other children
on a standardized test?
What is the probability that there is a positive effect of schooling on wage?
Such probabilistic statements are natural to Bayesian analysis because of the underlying assumption
that all parameters are random quantities. In Bayesian analysis, a parameter is summarized by an
entire distribution of values instead of one fixed value as in classical frequentist analysis.
Estimating this distribution, a posterior distribution of a parameter of interest, is at the heart of
Bayesian analysis.
Bayesian inference: It uses the posterior distribution to form various summaries for the model
parameters, including point estimates such as posterior means, medians, percentiles, and interval
estimates known as credible intervals. Moreover, all statistical tests about model parameters can be
expressed as probability statements based on the estimated posterior distribution.
Bayesian Network: These are a type of probabilistic graphical model that uses Bayesian inference
for probability computations. Bayesian networks aim to model conditional dependence, and
therefore causation, by representing conditional dependence by edges in a directed graph. Through
these relationships, one can efficiently conduct inference on the random variables in the graph
through the use of factors.
Using the relationships specified by our Bayesian network, we can obtain a compact, factorized
representation of the joint probability distribution by taking advantage of conditional independence.
A bayesian neywork is a directed acyclic graph, in which each edge corresponds to a conditional
depenedency and each node corresponds to a unique random varaible. Formally, if an edge (A, B)
exists in the graph connecting random variables A and B, it means that P(B|A) is a factor in the
joint probability distribution, so we must know P(B|A) for all values of B and A in order to conduct
inference. In the above example, since Rain has an edge going into WetGrass, it means that
P(WetGrass|Rain) will be a factor, whose probability values are specified next to the WetGrass node
in a conditional probability table.
Bayesian networks satisfy the local Markov property, which states that a node is conditionally
independent of its non-descendants given its parents. In the above example, this means that
P(Sprinkler|Cloudy, Rain) = P(Sprinkler|Cloudy) since Sprinkler is conditionally independent of its
non-descendant, Rain, given Cloudy. This property allows us to simplify the joint distribution,
obtained in the previous section using the chain rule, to a smaller form. After simplification, the
joint distribution for a Bayesian network is equal to the product of P(node|parents(node)) for all
nodes,
Inference
The first is simply evaluating the joint probability of a particular assignment of values for each
variable (or a subset) in the network. For this, we already have a factorized form of the joint
distribution, so we simply evaluate that product using the provided conditional probabilities. If we
only care about a subset of variables, we will need to marginalize out the ones we are not interested
in. In many cases, this may result in underflow, so it is common to take the logarithm of that
product, which is equivalent to adding up the individual logarithms of each term in the product.
The second, more interesting inference task, is to find P(x|e), or, to find the probability of some
assignment of a subset of the variables (x) given assignments of other variables (our evidence, e). In
the above example, an example of this could be to find P(Sprinkler, WetGrass | Cloudy), where
{Sprinkler, WetGrass} is our x, and {Cloudy} is our e. In order to calculate this, we use the fact that
P(x|e) = P(x, e) / P(e) = αP(x, e), where α is a normalization constant that we will calculate at the
end such that P(x|e) + P(¬x | e) = 1. In order to calculate P(x, e), we must marginalize the joint
probability distribution over the variables that do not appear in x or e, which we will denote as Y.
Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for
both classification or regression challenges. However, it is mostly used in classification problems.
In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is number
of features you have) with the value of each feature being the value of a particular coordinate. Then,
we perform classification by finding the hyper-plane that differentiates the two classes very well
Working of SVM:
1. Identify the right hyper-plane (Scenario-1): Here, we have three hyper-planes (A, B and C).
Now, identify the right hyper-plane to classify star and circle.
“Select the hyper-plane which segregates the two classes better”. In this scenario, hyper-plane “B”
has excellently performed this job.
Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B and
C) and all are segregating the classes well. Now, How can we identify the right hyper-plane?
Here, maximizing the distances between nearest data point (either class) and hyper-plane will help
us to decide the right hyper-plane. This distance is called as Margin. Let’s look at the below
snapshot:
Above, you can see that the margin for hyper-plane C is high as compared to both A and B. Hence,
we name the right hyper-plane as C. Another lightning reason for selecting the hyper-plane with
higher margin is robustness. If we select a hyper-plane having low margin then there is high chance
of miss-classification
Identify the right hyper-plane (Scenario-3):Hint: Use the rules as discussed in previous section to
identify the right hyper-plane
Some of you may have selected the hyper-plane B as it has higher margin compared to A. But, here
is the catch, SVM selects the hyper-plane which classifies the classes accurately prior
to maximizing margin. Here, hyper-plane B has a classification error and A has classified all
correctly. Therefore, the right hyper-plane is A.
SVM Kernel: In the SVM classifier, it is easy to have a linear hyper-plane between these
two classes. But, another burning question which arises is, should we need to add this
feature manually to have a hyper-plane. No, the SVM algorithm has a technique called the
kernel trick. The SVM kernel is a function that takes low dimensional input space and
transforms it to a higher dimensional space i.e. it converts not separable problem to
separable problem. It is mostly useful in non-linear separation problem. Simply put, it does
some extremely complex data transformations, then finds out the process to separate the data
based on the labels or outputs you’ve defined.
4. Intervention analysis: how does a single event change the time series?
A time series model is a tool used to predict future values of a series by analyzing the relationship
between the values observed in the series and the time of their occurrence. Time series models can
be developed using a variety of time series statistical techniques. If there has been any trend and/or
seasonal variation present in the data in the past then time series models can detect this variation,
use this information in order to fit the historical data as closely as possible, and in doing so improve
the precision of future forecasts. There are many traditional techniques used in time series analysis.
Some of these include:
■Exponential Smoothing
■Autoregression
■Intervention Analysis
■Seasonal Decomposition
ARIMA stands for AutoRegressive Integrated Moving Average, and the assumption of these models
is that the variation accounted for in the series variable can be divided into three components:
■Autoregressive (AR)
An ARIMA model can have any component, or combination of components, at both the
nonseasonal and seasonal levels. There are many different types of ARIMA models and the general
form of an ARIMA model is ARIMA(p,d,q)(P,D,Q), where:
■p refers to the order of the nonseasonal autoregressive process incorporated into the ARIMA
model (and P the order of the seasonal autoregressive process)
■d refers to the order of nonseasonal integration or differencing (and D the order of the seasonal
integration or differencing)
■q refers to the order of the nonseasonal moving average process incorporated in the model (and Q
the order of the seasonal moving average process).
Detecting trends and patterns in financial data is of great interest to the business world to support
the decision-making process. A new generation of methodologies, including neural networks,
knowledge-based systems and genetic algorithms, has attracted attention for analysis of trends and
patterns. In particular, neural networks are being used extensively for financial forecasting with
stock markets, foreign exchange trading, commodity future trading and bond yields. The application
of neural networks in time series forecasting is based on the ability of neural networks to
approximate nonlinear functions. In fact, neural networks offer a novel technique that doesn’t
require a pre-specification during the modelling process because they independently learn the
relationship inherent in the variables. The term neural network applies to a loosely related family of
model, characterized by a large parameter space and flexible structure, descending from studies
such as the study of the brain function.
Rule Induction:
Rule induction is a data mining process of deducing if-then rules from a data set. These symbolic
decision rules explain an inherent relationship between the attributes and class labels in the data set.
Many real-life experiences are based on intuitive rule induction. For example, we can proclaim a
rule that states “if it is 8 a.m. on a weekday, then highway traffic will be heavy” and “if it is 8 p.m.
on a Sunday, then the traffic will be light.” These rules are not necessarily right all the time. 8 a.m.
weekday traffic may be light during a holiday season. But, in general, these rules hold true and are
deduced from real-life experience based on our every day observations. Rule induction provides a
powerful classification approach .
Learning:
The principal reason why neural networks have attracted such interest, is the existence of learning
algorithms for neural networks: algorithms that use data to estimate the optimal weights in a
network to perform some task. There are three basic approaches to learning in neural networks.
Supervised learning: It uses a training set that consists of a set of pattern pairs: an input pattern
and the corresponding desired (or target) output pattern. The desired output maybe regarded as the
‘network’s ‘teacher” for that input. The basic approach in supervised learning is for the network to
compute the output its current weights producefor a given input, and to compare this network output
with the desired output. Theaim of the learning algorithm is to adjust the weights so as minimize the
difference between the network output and the desired output.
Reinforcement learning: It uses much less supervision. If a network aims to perform that some task,
then the reinforcement signal is a simple “yes” or “no” at the endof thetask to indicate whether the
task has been performed satisfactorily.
Unsupervised learning: It only uses input data there is no training signal, unlike the previous two
approaches. The aim of unsupervised learning is to make sense of some data set,for example
clustering similar patterns together. Compression.
Generalization of Neural Network: One of the major advantages of neural nets is their ability to
generalize. This means that a trained net could classify data from the same class as the learning data
that it has never seen before. In real world applications developers normally have only a small part
of all possible patterns for the generation of a neural net. To reach the best generalization, the
dataset should be split into three parts:
The training set is used to train a neural net. The error of this dataset is minimized during
training.
The validation set is used to determine the performance of a neural network on patterns that
are not trained during learning.
A test set for finally checking the over all performance of a neural net.
The learning should be stopped in the minimum of the validation set error. At this point the net
generalizes best. When learning is not stopped, overtraining occurs and the performance of the net
on the whole data decreases, despite the fact that the error on the training data still gets smaller.
After finishing the learning phase, the net should be finally checked with the third data set, the test
set.
In a competitive learning model, there are hierarchical sets of units in the network with inhibitory
and excitatory connections. The excitatory connections are between individual layers and the
inhibitory connections are between units in layered clusters. Units in a cluster are either active or
inactive.
Principal Component Analysis: Principal components analysis (PCA) is a statistical technique that
allows identifying underlying linear patterns in a data set so it can be expressed in terms of other
data set of a significatively lower dimension without much loss of information.
The final data set should explain most of the variance of the original data set by reducing the
number of variables. The final variables will be named as principal components.
1. Subtract mean: The first step in the principal component analysis is to subtract the mean for
each variable of the data set.
2. Calculate the covariance matrix: The covariance of two random variables measures the
degree of variation from their means for each other. The sign of the covariance provides us
with information about the relation between them:
• If the covariance is positive, then the two variables increase and decrease together.
• If the covariance is negative, then when one variable increases, the other decreases,
and vice versa.
These values determine the linear dependencies between the variables, which will be used to reduce
the data set's dimension.
3. Calculate eigenvectors and eigenvalues: Eigenvectors are defined as those vectors whose
directions remain unchanged after any linear transformation has been applied. However,
their length could not remain the same after the transformation, i.e., the result of this
transformation is the vector multiplied by a scalar. This scalar is called eigenvalue, and each
eigenvector has one associated with it. The number of eigenvectors or components that we
can calculate for each data set is equal to the data set's dimension.Since they are calculated
from the covariance matrix described before, eigenvectors represent the directions in which
the data have a higher variance. On the other hand, their respective eigenvalues determine
the amount of variance that the data set has in that direction.
4. Select principal components: Among the available eigenvectors that previously calculated,
we must select those onto which we project the data. The selected eigenvectors will be
called principal components. To establish a criterion to select the eigenvectors, we must first
define the relative variance of each and the total variance of a data set. The relative variance
of an eigenvector measures how much information can be attributed to it. The total variance
of a data set is the sum of the variance of all the variables. These two concepts are
determined by the eigenvalues.
5. Reduce the data dimension: Once we have selected the principal components, the data
must be projected onto them. Although this projection can explain most of the variance of
the original data, we have lost the information about the variance along with the second
component. In general, this process is irreversible, which means that we cannot recover the
original data from the projection.
It is defined as a control logic that pretends to use degrees of input and output to estimate human
reasoning with the integration of rule-based implementation. The technique used in the
manipulation of undesired information or facts which involves some degree of uncertainty.
In the context of intelligent data analysis it is of great interest how such fuzzy models can
automatically be derived from example data. Since, besides predic-tion, understandability is of
prime concern, the resulting fuzzy model should offer insights into the underlying system. To
achieve this, different approaches exist that construct grid-based rule sets defining a global
granulation of the input space, as well as fuzzy graph based structures.
Grid-based rule sets model each input variable through a usually small set of lin-guistic values. The
resulting rule base uses aU or a subset of all possible combina-tions of these linguistic values for
each variable, resulting in a global granulation of the feature space into "tiles. Extracting grid-based
fuzzy models from data is straightforward when the input granulation is fixed, that is, the
antecedents of all rules are predefined. Then only a matching consequent for each rule needs to be
found.
In high-dimensional feature spaces a global granulation results in a large number of rules. For these
tasks a fuzzy graph based approach is more suitable. A possible disadvantage of the individual
membership functions is the poten-tial loss of interpretation. Projecting all membership functions
onto one variable will usually not lead to meaningful linguistic values. In many data analysis ap-
plications, however, such a meaningful granulation of all attributes is either not available or hard to
determine automatically.
The primary function of a fuzzy graph isto serve as a representation of an imprecisely defined
dependency. Fuzzy graphs do not have a natural linguistic interpretation of the granulation of their
input space. The main advantage is the low dimensionality of the individual rules. The algorithm
only introduces restriction on few of the available input variables, thus making the extracted rules
easier to interpret. The final set of fuzzy points forms a fuzzy graph, where each fuzzy point is
associated with one output region and is used to compute a membership value for a certain input
pattern. The maximum degree of membership of all fuzzy points for one region determines the
overall degree of membership. Fuzzy inference then produces a soft value for the output and using
the well-known center-of-gravity method a final crisp output value can be obtained, if so desired.
An extension to decision trees based on fuzzy logic can be derived. Different branches of the tree
are then distinguished by fuzzy queries .The introduction of fuzzy set theory in Zadeh (1965),
offered a general methodology that allows notions of vagueness and imprecision to be considered.
Moreover, Zadeh’s work allowed the possibility for previously defined techniques to be considered
with a fuzzy environment. It was over ten years later that the area of decision trees benefited from
this fuzzy environment opportunity.
Decision trees based on fuzzy set theory combines the advantages of good comprehensibility of
decision trees and the ability of fuzzy representation to deal with inexact and uncertain
information.”
In fuzzy set theory (Zadeh, 1965), the grade of membership of a value x to a set S is defined
through a membership function ji(x) that can take a value in the range [0, 1]. The accompanying
numerical attribute domain can be described by a finite series of MFs that each offers a grade of
membership to describe x, which collectively form its concomitant fuzzy number. In this article,
MFs are used to formulate linguistic variables for the considered attributes. These linguistic
variables are made up of sets of linguistic terms which are defined by the Mfs..
The small data set considered, consists of five objects, described by three condition attributes T1,
T2 and T3, and classified by a single decision attribute C, see Table 1.
If these values are considered imprecise, fuzzy, there is the option to transform the data values in
fuzzy values. Here, an attribute is transformed into a linguistic variable, each described by two
linguistic term.
In Figure 2, the decision attribute C is shown to be described by the linguistic terms, CL and CH
(possibly denoting the terms low and high). These linguistic terms are themselves defined by MFs
(^C (■) and (■). The hypothetical MFs shown have the respective defining terms of , (■): [-<, -<, 9,
25, 32] and MCh(‘): [9, 25, 32, <, <]. To demonstrate their utilisation, for the obj ect u2, with a
value C = 17, its fuzzification creates the two values ^ (17) = 0.750 and ^ (17) = 0.250, the larger of
which is associated with the high linguistic term.
A similar series of membership functions can be constructed for the three condition attributes, T1,
T2 and T3, Figure 3.
In Figure 3, the linguistic variable version of each condition attribute is described by two linguistic
terms (possibly termed as low and high), themselves defined by MFs. The use of these series of
MFs is the ability to fuzzify the example data set, see Table 2.
In Table 2, each object is described by a series of fuzzy values, two fuzzy values for each
attribute. Also shown in Table 2, in bold, are the larger of the values in each pair of fuzzy values,
with the respective linguistic term this larger value is associated with. Beyond the fuzzification of
the data set, attention turns to the construction of the concomitant fuzzy decision tree for this data.
Prior to this construction process, a threshold value of p = 0.75 for the minimum required truth level
was used throughout.
The construction process starts with the condition attribute that is the root node. For this, it is
necessary to calculate the classification ambiguity G(E) of each condition attribute. The evaluation
of a G(E) value is shown for the first attribute T1 (i.e. g(n(C| T1))), where it is broken down to the
fuzzy labels L and H, for L;
The subsethood values in this case are; for T1: i(T1L, CL) = 0.574 and S(T1L, CH) = 0.426, and
S(T2H, CL) = 0.452 and S(T2H, CH) = 0.548. For T2L and T2H, Hie larger subsethood value (in
bold), defines the possible classification for that path. In both cases these values are less that the
threshold truth value 0.75 employed, so neither of these paths can be terminated to a leaf node,
instead further augmentation of them is considered.
With three condition attributes included in the example data set, the possible augmentation to
T1L is with either T2 or T3. Concentrating on T2, where with G(T1l) = 0. 0.514, the ambiguity with
partition evaluated for T2 (G(T1L and T2| C)) has to be less than this value, where;
Starting with the weight values, in the case of T1L and T2l, it follows;
A concomitant value for G(T1L and T3| C) = 0.487, the lower of these (G(T1L and T2| C)) is
lower than the concomitant G(T1L) = 0.514, so less ambiguity would be found if the T2 attribute
was augmented to the path T1 = L. The subsequent subsethood values in
each suggested classification path, the largest subset-hood value is above the truth level threshold,
therefore they are both leaf nodes leading from the T1 = L path. The construction process continues
in a similar vein for the path T1 = H, with the resultant fuzzy decision tree in this case presented in
Figure 4.
The fuzzy decision tree shows five rules (leaf nodes), R1, R2, …, R5, have been constructed.
There are a maximum of four levels to the tree shown, indicating a maximum of three condition
attributes are used in the rules constructed. In each non-root node shown the subsethood levels to
the decision attribute terms C = L and C = H are shown. On the occasions when the larger of the
subsethood values is above the defined threshold value of 0.75 then they are shown in bold and
accompany the node becoming a leaf node.
Stochastic search is the method of choice for solving many hard combinatorial problems.Stochastic
search algorithms are designed for problems with inherent random noise or deterministic problems
solved by injected randomness. In structural optimization, these are problems with uncertainties of
design variables or those where adding random perturbation to deterministic design variables is the
method to perform the search The search favors designs with better performance. An important
feature of stochastic search algorithms is that they can carry out broad search of the design space
and thus avoid local optima. Also, stochastic search algorithms do not require gradients to guide the
search, making them a good fit for discrete problems. However, there is no necessary condition for
an optimum solution and the algorithm must run multiple times to make sure the attained solutions
are robust. To handle constraints, penalties can also be applied on designs that violate constraints.
For constraints that are difficult to be formulated explicitly, a true/false check is straightforward to
implement. Randomly perturbed designs are checked against constraints, and only those passing the
check will enter the stage of performance evaluation. Stochastic search can be applied on one
design or on a population of them (Leng, 2015), using for example SA or GA, respectively. Arora
(2004) depicts the logic of SA and GA for convenience of application. A monograph devoted to
stochastic search and optimization (Spall, 2003) provides further details on a broad scope, including
mathematical theory, algorithm design, and applications in simulation and control.
The simulated annealing approach is in the realm of problem solving methods that make use of
paradigms found in nature. Annealing denotes the process of cooling a molten substance and is, for
example, used to harden steel. One major effect of "careful" annealing is the condensing of matter
into a crystalline solid. The hardening of steel is achieved by first raising the temperature close to
the transition to its hquid phase, then cooling the steel slowly to aUow the molecules to arrange in
an ordered lattice pattern. Hence, annealing can be regarded as an adaptation process optimizing the
stability of the final crystalline solid. Whether a state of minimum free energy is reached, depends
very much on the actual speed with which the temperature is being decreased.
Unit-III
Stream
•Data stream transmits from a source and receives at the processing end in a network
•A continuous stream of data flows between the source and receiver ends, and which is processed in
real time
Data Streams: In many data mining situations, we do not know the entire data set in advance
• Google queries
Imagine a factory with 500 sensors capturing 10 KB of information every second, in one hour is
captured nearby 36 GB of information and 432 GB daily. This massive information needs to be
analysed in real time (or in the shortest time possible) to detect irregularities or deviations in the
system and quickly react. Stream Mining enables to analyse large amounts of data in real-time.
Stream Mining enables the analysis of massive quantities of data in real time using bounded
resources.
Data Stream Mining is the process of extracting knowledge from continuous rapid data records
which comes to the system in a stream. A Data Stream is an ordered sequence of instances in time
Data Stream Mining fulfil the following characteristics:
• Continuous Stream of Data. High amount of data in an infinite stream. we do not know the
entire dataset
• Concept Drifting. The data change or evolves over time
• Volatility of data. The system does not store the data received (Limited resources). When
data is analysed it’s discarded or summarised
Streams may be archived in a large archival store, but we assume it is not possible to answer queries
from the archival store. It could be examined only under special circumstances using time-
consuming retrieval processes. There is also a working store, into which summaries or parts of
streams may be placed, and which can be used for answering queries. The working store might be
disk, or it might be main memory, depending on how fast we need to process queries.
But either way, it is of sufficiently limited capacity that it cannot store all the data from all the
streams
Input elements enter at a rapid rate, at one or more input ports (i.e., streams). We call elements of
the stream tuples. The system cannot store the entire stream accessibly. The question arises:
Q: How do you make critical calculations about the stream using a limited amount of (secondary)
memory?
A streaming data architecture is an information technology framework that puts the focus on
processing data in motion and treats extract-transform-load (ETL) batch processing as just one more
event in a continuous stream of events. This type of architecture has three basic components -- an
aggregator that gathers event streams and batch files from a variety of data sources, a broker that
makes data available for consumption and an analytics engine that analyzes the data, correlates
values and blends streams together.
The system that receives and sends data streams and executes the application and real-time analytics
logic is called the stream processor. Because a streaming data architecture supports the concept of
event sourcing, it reduces the need for developers to create and maintain shared databases. Instead,
all changes to an application’s state are stored as a sequence of event-driven processing (ESP)
triggers that can be reconstructed or queried when necessary. Upon receiving an event, the stream
processor reacts in real- or near real-time and triggers an action, such as remembering the event for
future reference.
The growing popularity of streaming data architectures reflects a shift in the development of
services and products from a monolithic architecture to a decentralized one built with
microservices. This type of architecture is usually more flexible and scalable than a classic
database-centric application architecture because it co-locates data processing with storage to lower
application response times (latency) and improve throughput. Another advantage of using a
streaming data architecture is that it factors the time an event occurs into account, which makes it
easier for an application’s state and processing to be partitioned and distributed across many
instances.
Streaming data architectures enable developers to develop applications that use both bound and
unbound data in new ways. For example, Alibaba’s search infrastructure team uses a streaming data
architecture powered by Apache Flink to update product detail and inventory information in real-
time. Netflix also uses Flink to support its recommendation engines and ING, the global bank based
in The Netherlands, uses the architecture to prevent identity theft and provide better fraud
protection. Other platforms that can accommodate both stream and batch processing include Apache
Spark, Apache Storm, Google Cloud Dataflow and AWS Kinesis.
Stream Computing
A high-performance computer system that analyzes multiple data streams from many sources live.
The word stream in stream computing is used to mean pulling in streams of data, processing the
data and streaming it back out as a single flow. Stream computing uses software algorithmsthat
analyzes the data in real time as it streams in to increase speed and accuracy when dealing with data
handling and analysis.
In June 2007, IBM announced its stream computing system, called System S. This system runs on
800 microprocessors and the System S software enables software applications to split up tasks and
then reassemble the data into an answer. ATI Technologies also announced a stream computing
technology that describes its technology that enables the graphics processors (GPUs) to work in
conjunction with high-performance, low-latency CPUs to solve complex computational problems.
ATI’s stream computing technology is derived from a class of applications that run on the GPU
instead of a CPU.
Steam Computing Enables continuous analysis of massive volumes of streaming data with sub-
millisecond response times. Learn more in: Big Data Techniques and Applications. The Usage of
single instruction multiple data computing paradigm to solve certain computational problems. Learn
more in: High-Performance Computing for Theoretical Study of Nanoscale and Molecular
Interconnects .
Stream Computing
Stream sampling is the process of collecting a representative sample of the elements of a data
stream. The sample is usually much smaller than the entire stream, but can be designed to retain
many important characteristics of the stream, and can be used to estimate many important
aggregates on the stream.
As the stream grows the sample also gets bigger. Since we can not store the entire stream, one
obvious approach is to store a sample.
(2) Maintain a random sample of fixed size over a potentially infinite stream
What is the property of the sample we want to maintain?For all time steps k, each of k elements
seen so far has equal prob. of being sampled
Simple question: What fraction of queries by an average search engine user are duplicates?
Suppose each user issues x queries once and d queries twice (total of x+2d queries)
Sample will contain x/10 of the singleton queries and 2d/10 of the duplicate queries at least once
Solution:
th
• Pick 1/10 of users and take all their searches in the sample
• Use a hash function that hashes the user name or user id uniformly into 10 buckets
Filtering Streams
Another common process on streams is selection, or filtering. We want to accept those tuples in the
stream that meet a criterion. Accepted tuples are passed to another process as a stream, while other
tuples are dropped. If the selection criterion is a property of the tuple that can be calculated (e.g., the
first component is less than 10), then the selection is easy to do. The problem becomes harder when
the criterion involves lookup for membership in a set. It is especially hard, when that set is too large
to store in main memory
The purpose of the Bloom filter is to allow through all stream elements whose keys are in S, while
rejecting most of the stream elements whose keys are not in S.
To initialize the bit array, begin with all bits 0. Take each key value in S and hash it using each of
the k hash functions. Set to 1 each bit that is h i (K) for some hash function h i and some key value
K in S.
are 1’s in the bit-array. If all are 1’s, then let the stream element through. If one or more of these bits
are 0, then K could not be in S, so reject the stream element
Suppose stream elements are chosen from some universal set. We would like to know how many
different elements have appeared in the stream, counting either from the beginning of the stream or
from some known time in the past.
A similar problem is a Web site like Google that does not require login to issue a search query, and
may be able to identify users only by the IP address from which they send the query. There are
about 4 billion IP addresses, 2 sequences of four 8-bit bytes will serve as the universal set in this
case. 2
The obvious way to solve the problem is to keep in main memory a list of all the elements seen so
far in the stream. Keep them in an efficient search structure such as a hash table or search tree, so
one can quickly add new elements and check whether or not the element that just arrived on the
stream was already seen. As long as the number of distinct elements is not too great, this structure
can fit in main memory and there is little problem obtaining an exact answer to the question how
many distinct elements appear in the stream.
However, if the number of distinct elements is too great, or if there are too many streams that need
to be processed at once (e.g., Yahoo! wants to count the number of unique users viewing each of its
pages in a month), then we cannot store the needed data in main memory. There are several options.
We could use more machines, each machine handling only one or several of the streams. We could
store most of the data structure in secondary memory and batch stream elements so whenever we
brought a disk block to main memory there would be many tests and updates to be performed on the
data in that block.
It is possible to estimate the number of distinct elements by hashing the elements of the universal
set to a bit-string that is sufficiently long. The length of the bit-string must be sufficient that there
are more possible results of the hash function than there are elements of the universal set. For
example, 64 bits is sufficient to hash URL’s. We shall pick many different hash functions and hash
each element of the stream using these hash functions. The important property of a hash function is
that when applied to the same element, it always produces the same result.
Estimating Moments
The problem, called computing “moments,” involves the distribution of frequencies of different
elements in the stream.
Definition of Moments
Suppose a stream consists of elements chosen from a universal set. Assume the universal set is
ordered so we can speak of the ith element for any i. Let m i be the number of occurrences of the ith
element for any i. Then the kth-order moment (or just kth moment) of the stream is the sum over all
i of (m i ) k.
The 1st moment is the sum of the m i ’s, which must be the length of the stream. Thus, first
moments are especially easy to compute; just count the length of the stream seen so far.
The second moment is the sum of the squares of the m i ’s. It is sometimes called the surprise
number, since it measures how uneven the distribution of elements in the stream is.
To see the distinction, suppose we have a stream of length 100, in which eleven different elements
appear. The most even distribution of these eleven elements would have one appearing 10 times
and the other ten appearing 9 times each. In this case, the surprise number is 10 2 + 10 × 9 2 = 910.
At the other extreme, one of the eleven elements could appear 90 times and the other ten appear 1
time each. Then, the surprise number would be 90 2 + 10 × 1 2 = 8110.
2. An integer X.value, which is the value of the variable. To determine the value of a variable X, we
choose a position in the stream between 1 and n, uniformly and at random. Set X.element to be the
element found there, and initialize X.value to 1. As we read the stream, add 1 to X.value each time
we encounter another occurrence of X.element .
At position 8 we find d, and so set X 2 .element = d and X 2 .value = 1. Positions 9 and 10 hold a
and b, so they do not affect X 1 or X 2 . Position 11 holds d so we set X 2 .value = 2, and position
12 holds c so we set X 1 .value = 3. At position 13, we find element a, and so set X 3 .element = a
and X 3 .value = 1. Then, at position 14 we see another a and so set X 3 .value = 2. Position 15,
with element b does not affect any of the variables, so we are done, with final values X 1 .value = 3
and X 2 .value = X 3 .value = 2. We can derive an estimate of the second moment from any variable
X. This estimate is n(2X.value − 1).
Consider the three variables from the above example. From X 1 we derive the estimate n(2X 1
.value − 1) = 15 × (2 × 3 − 1) = 75. The other two variables, X 2 and X 3 , each have value 2 at the
end, so their estimates are 15 × (2 × 2 − 1) = 45. Recall that the true value of the second moment for
this stream is 59. On the other hand, the average of the three estimates is 55, a fairly close
approximation.
Suppose we have a window of length N on a binary stream. We want at all times to be able to
answer queries of the form “how many 1’s are there in the last k bits?” for any k ≤ N . As in
previous sections, we focus on the situation where we cannot afford to store the entire window.
After showing an approximate algorithm for the binary case, we discuss how this idea can be
extended to summing numbers.
To begin, suppose we want to be able to count exactly the number of 1’s in the last k bits for any k ≤
N . Then we claim it is necessary to store all N bits of the window, as any representation that used
fewer than N bits could not work. In proof, suppose we have a representation that uses fewer than N
bits to represent the N bits in the window. Since there are 2 N sequences of N bits, but fewer than 2
N representations, there must be two different bit strings w and x that have the same representation.
Since w 6 = x, they must differ in at least one bit. Let the last k − 1 bits of w and x agree, but let
them differ on the kth bit from the right end.
Example : If w = 0101 and x = 1010, then k = 1, since scanning from the right, they first disagree at
position 1. If w = 1001 and x = 0101, then k = 3, because they first disagree at the third position
from the right.
Suppose the data representing the contents of the window is whatever sequence of bits represents
both w and x. Ask the query “how many 1’s are in the last k bits?” The query-answering algorithm
will produce the same answer, whether the window contains w or x, because the algorithm can only
see their representation. But the correct answers are surely different for these two bit-strings. Thus,
we have proved that we must use at least N bits to answer queries about the last k bits for any k.
In fact, we need N bits, even if the only query we can ask is “how many 1’s are in the entire window
of length N ?” The argument is similar to that used above. Suppose we use fewer than N bits to
represent the window, and therefore we can find w, x, and k as above. It might be that w and x have
the same number of 1’s, as they did in both cases of Example 4.10. However, if we follow the
current window by any N − k bits, we will have a situation where the true window contents
resulting from w and x are identical except for the leftmost bit, and therefore, their counts of 1’s are
unequal. However, since the representations of w and x are the same, the representation of the
window must still be the same if we feed the same bit sequence to these representations.
We shall present the simplest case of an algorithm called DGIM. This version of the algorithm uses
O(log 2 N ) bits to represent a window of N bits, and allows us to estimate the number of 1’s in the
window with an error of no more than 50%. Later, we shall discuss an improvement of the method
that limits the error to any fraction ǫ > 0, and still uses only O(log 2 N ) bits (although with a
constant factor that grows as ǫ shrinks).
To begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on.
Since we only need to distinguish positions within the window of length N , we shall represent
timestamps modulo N , so they can be represented by log 2 N bits. If we also store the total number
of bits ever seen in the stream (i.e., the most recent timestamp) modulo N , then we can determine
from a timestamp modulo N where in the current window the bit with that timestamp is.
2. The number of 1’s in the bucket. This number must be a power of 2, and we refer to the number
of 1’s as the size of the bucket.
To represent a bucket, we need log 2 N bits to represent the timestamp (modulo N ) of its right end.
To represent the number of 1’s we only need log 2 log 2 N bits. The reason is that we know this
number i is a power of 2, say 2 j , so we can represent i by coding j in binary. Since j is at most log 2
N , it requires log 2 log 2 N bits. Thus, O(log N ) bits suffice to represent a bucket.
There are six rules that must be followed when representing a stream by buckets.
• There are one or two buckets of any given size, up to some maximum size.
Decaying Windows
We have assumed that a sliding window held a certain tail of the stream, either the most recent N
elements for fixed N , or all the elements that arrived after some time in the past. Sometimes we do
not want to make a sharp distinction between recent elements and those in the distant past, but want
to weight the recent elements more heavily. In this section, we consider “exponentially decaying
windows,” and an application where they are quite useful: finding the most common “recent”
elements.We have assumed that a sliding window held a certain tail of the stream, either the most
recent N elements for fixed N , or all the elements that arrived after some time in the past.
Sometimes we do not want to make a sharp distinction between recent elements and those in the
distant past, but want to weight the recent elements more heavily.
Suppose we have a stream whose elements are the movie tickets purchased all over the world, with
the name of the movie as part of the element. We want to keep a summary of the stream that is the
most popular movies “currently.” While the notion of “currently” is imprecise, intuitively, we want
to discount the popularity of a movie like Star Wars–Episode 4, which sold many tickets, but most
of these were sold decades ago. On the other hand, a movie that sold n tickets in each of the last 10
weeks is probably more popular than a movie that sold 2n tickets last week but nothing in previous
weeks. One solution would be to imagine a bit stream for each movie. The ith bit has value 1 if the
ith ticket is for that movie, and 0 otherwise. Pick a window size N , which is the number of most
recent tickets that would be considered in evaluating popularity. Then, use the method of Section
4.6 to estimate the number of tickets for each movie, and rank movies by their estimated counts.
This technique might work for movies, because there are only thousands of movies, but it would fail
if we were instead recording the popularity of items sold at Amazon, or the rate at which different
Twitter-users tweet, because there are too many Amazon products and too many tweeters.
An alternative approach is to redefine the question so that we are not asking for a count of 1’s in a
window. Rather, let us compute a smooth aggregation of all the 1’s ever seen in the stream, with
decaying weights, so the further back in the stream, the less weight is given. Formally, let a stream
currently consist of the elements a 1 , a 2 , . . . , a t , where a 1 is the first element to arrive and a t is
It is much easier to adjust the sum in an exponentially decaying window than in a sliding window of
fixed length. In the sliding window, we have to worry about the element that falls out of the window
each time a new element arrives. That forces us to keep the exact elements along with the sum, or to
use an approximation scheme such as DGIM. However, when a new element a t+1 arrives at the
stream input, all we need to do is:
2. Add a t+1 .
The reason this method works is that each of the previous elements has now moved one position
further from the current element, so its weight is multiplied by 1 − c. Further, the weight on the
current element is (1 − c) 0 = 1, so adding a t+1 is the correct way to include the new element’s
contribution.
Let us return to the problem of finding the most popular movies in a stream of ticket sales. 6 We
shall use an exponentially decaying window with a constant c, which you might think of as 10 −9 .
That is, we approximate a sliding window holding the last one billion ticket sales. For each movie,
we imagine a separate stream with a 1 each time a ticket for that movie appears in the stream, and a
0 each time a ticket for some other movie arrives. The decaying sum of the 1’s measures the current
popularity of the movie. We imagine that the number of possible movies in the stream is huge, so
we do not want to record values for the unpopular movies. Therefore, we establish a threshold, say
1/2, so that if the popularity score for a movie goes below this number, its score is dropped from the
counting. For reasons that will become obvious, the threshold must be less than 1, although it can be
any number less than 1. When a new ticket arrives on the stream, do the following:
1. For each movie whose score we are currently maintaining, multiply its score by (1 − c).
2. Suppose the new ticket is for movie M . If there is currently a score for M , add 1 to that score. If
there is no score for M , create one and initialize it to 1.
It may not be obvious that the number of movies whose scores are main-tained at any time is
limited. However, note that the sum of all scores is 1/c. There cannot be more than 2/c movies with
score of 1/2 or more, or else the sum of the scores would exceed 1/c. Thus, 2/c is a limit on the
number of movies being counted at any time. Of course in practice, the ticket sales would be
concentrated on only a small number of movies at any time, so the number of actively counted
movies would be much less than 2/c.
Real-time analytics
• Real-Time Analytics Platform (RTAP) analyses the data, correlates, and predicts the
outcomes in the real time.
RTAP
• Apache SparkStreaming—a Big Data platform for data stream analytics in real time.
• Cisco Connected Streaming Analytics (CSA)—a platform that delivers insights from high-
velocity streams of live data from multiple sources and enables immediate action.
RTAP Applications:
Positive/Negative Sentiments
1.NEGATION
2. POSITIVE SMILEY
3. NEGATIVE SMILEY
5. LAUGH
Data science is being used to provide a unique understanding of the stock market and financial data.
Securities, commodities, and stocks follow some basic principles for trading. We can either sell,
buy, or hold. The goal is to make the largest profit possible.
Trading platforms became very popular in the last two decades, but each platform offers different
options, tools, fees, etc. Despite this growing trend, Canadians still haven’t been able to access zero
trading commission platforms. Gary Stevens from Hosting Canada conducted a 12-month research
on how some of the most popular stock trading platforms work, and compared what each of them
offers to its users. You need to understand how they work in order to pick what’s best for you – and
Gary’s thorough guide is able to help you with that.
There are a lot of phrases used in data science that a person would have to be a scientist to know. At
its most basic level, data science is math that is sprinkled with an understanding of programming
and statistics.
There are certain concepts in data science that are used when analyzing the market. In this context,
we are using the term “analyze” to determine whether it is worth it to invest in a stock. There are
some basic data science concepts that are good to be familiar with.
Algorithms are used extensively in data science. Basically, an algorithm is a group of rules needed
to perform a task. You have likely heard about algorithms being used when buying and selling
stocks. Algorithmic trading is where algorithms set rules for things like when to buy a stock or
when to sell a stock.
For example, an algorithm could be set to purchase a stock once it drops by eight percent over the
course of the day or to sell the stock if it loses 10 percent of its value compared to when it was first
purchased. Algorithms are designed to function without human intervention. You may have heard of
them referred to as bots. Like robots, they make calculated decisions devoid of emotions.
We are not talking about preparing to run a 50 meter race. In machine learning and data science,
training is where data is used to train a machine on how to respond. We can create a learning model.
This machine learning model makes it possible for a computer to make accurate predictions based
on the information it learned from the past. If you want to teach a machine to predict the future of
stock prices, it would need a model of the stock prices of the previous year to use as a base to
predict what will happen.
We have the data for stock prices for the last year. The training set would be the data from January
to October. Then, we will use November and December as our testing set. Our machine should have
learned by evaluating how the stocks worked from January through October. Now, we will ask it to
predict what should have happened in November and December of that year. The predictions the
machine makes will get compared to the real prices. The amount of variation that we see what the
model predicts and the real data are what we are trying to eliminate as we adjust our training model.
Data science relies heavily on modeling. This is an approach that uses math to examine past
behaviors with the goal of forecasting future outcomes. In the stock market, a time series model is
used.A time series is data, which in this case refers to the value of a stock, that is indexed over a
period of time. This period of time could be divided hourly, daily, monthly, or even by the minute. A
time series model is created by using machine learning and/or deep learning models to accumulate
the price data. The data needs to be analyzed and then fitted to match the model. This is what makes
it possible to predict future stock prices over a set timetable.
A second type of modeling that is used in machine learning and in data science is referred to as a
classification model. These models are given data points and then they strive to classify or predict
what is represented by those data points.
When discussing the stock market or stocks in general, a machine learning model can be given
financial data like the P/E ratio, total debt, volume, etc. and then determine if a stock is a sound
investment. Depending on the financials we give, a model can determine if now is the time to sell,
hold, or buy a stock.
A model could predict something with so much complexity that it overlooks the relationship
between the feature and the target variable. This is referred to as overfitting. Underfitting is where a
model doesn’t sufficiently match the data, so the results are predictions that are too simple.
Overfitting is a problem if the model finds it difficult to identify stock market trends, so it can’t
adapt to future events. Underfitting is where a model predicts the simple average price based on the
stock’s entire history. Both overfitting and underfitting lead to poor forecasts and predictions.
We have barely scratched the surface when discussing the link between machine learning concepts
and stock market investments. However, it is important to understand the basic concepts we have
discussed today as they serve as a basis for comprehending how machine learning is used to predict
what the stock market can do. There are more concepts that can be learned by those who want to get
to the nitty-gritty of data science and how it relates to the stock market.
Unit-IV
Frequent itemset mining, a precursor to association rule mining, typically requires significant
processing power since this process involves multiple passes through a database, and this can be a
challenge in large streaming datasets. Though there is great deal of progress in finding frequent
itemsets and association rules in static or permanent databases.
Market Basket Analysis is a technique which identifies the strength of association between pairs of
products purchased together and identify patterns of co-occurrence. A co-occurrence is when two or
more things take place together.
Market Basket Analysis creates If-Then scenario rules, for example, if item A is purchased then
item B is likely to be purchased. The rules are probabilistic in nature or, in other words, they are
derived from the frequencies of co-occurrence in the observations. Frequency is the proportion of
baskets that contain the items of interest. The rules can be used in pricing strategies, product
placement, and various types of cross-selling strategies.
In order to make it easier to understand, think of Market Basket Analysis in terms of shopping at a
supermarket. Market Basket Analysis takes data at transaction level, which lists all items bought by
a customer in a single purchase. The technique determines relationships of what products were
purchased with which other product(s). These relationships are then used to build profiles
containing If-Then rules of the items purchased.
The If part of the rule (the {A} above) is known as the antecedent and the THEN part of the rule is
known as the consequent (the {B} above). The antecedent is the condition and the consequent is the
result. The association rule has three measures that express the degree of confidence in the rule,
Support, Confidence, and Lift.
For example, you are in a supermarket to buy milk. Based on the analysis, are you more likely to
buy apples or cheese in the same transaction than somebody who did not buy milk?
When one hears Market Basket Analysis, one thinks of shopping carts and supermarket shoppers. It
is important to realize that there are many other areas in which Market Basket Analysis can be
applied. An example of Market Basket Analysis for a majority of Internet users is a list of
potentially interesting products for Amazon. Amazon informs the customer that people who bought
the item being purchased by them, also reviewed or bought another list of items. A list of
applications of Market Basket Analysis in various industries is listed below:
• Retail. In Retail, Market Basket Analysis can help determine what items are purchased
together, purchased sequentially, and purchased by season. This can assist retailers to
determine product placement and promotion optimization (for instance, combining product
incentives). Does it make sense to sell soda and chips or soda and crackers?
• Telecommunications. In Telecommunications, where high churn rates continue to be a
growing concern, Market Basket Analysis can be used to determine what services are being
utilized and what packages customers are purchasing. They can use that knowledge to direct
marketing efforts at customers who are more likely to follow the same path.
For instance, Telecommunications these days is also offering TV and Internet. Creating
bundles for purchases can be determined from an analysis of what customers purchase,
thereby giving the company an idea of how to price the bundles. This analysis might also
lead to determining the capacity requirements.
• Banks. In Financial (banking for instance), Market Basket Analysis can be used to analyze
credit card purchases of customers to build profiles for fraud detection purposes and cross-
selling opportunities.
• Insurance. In Insurance, Market Basket Analysis can be used to build profiles to detect
medical insurance claim fraud. By building profiles of claims, you are able to then use the
profiles to determine if more than 1 claim belongs to a particular claimee within a specified
period of time.
• Medical. In Healthcare or Medical, Market Basket Analysis can be used for comorbid
conditions and symptom analysis, with which a profile of illness can be better identified. It
can also be used to reveal biologically relevant associations between different genes or
between environmental effects and gene expression.
Apriori Algorithm:
Apriori is an algorithm used for Association Rule Mining. It searches for a series of frequent sets of
items in the datasets. It builds on associations and correlations between the itemsets. It is the
algorithm behind “You may also like” where you commonly saw in recommendation platforms.
ARM( Associate Rule Mining) is one of the important techniques in data science. In ARM, the
frequency of patterns and associations in the dataset is identified among the item sets then used to
predict the next relevant item in the set. This ARM technique is mostly used in business decisions
according to customer purchases.
Example: In Walmart, if Ashok buys Milk and Bread, the chances of him buying Butter are
predicted by the Associate Rule Mining technique.
• Support(A->B) = Support_count(A U B)
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a
dataset for boolean association rule. Name of the algorithm is Apriori because it uses prior
knowledge of frequent itemset properties. We apply an iterative approach or level-wise search
where k-frequent itemsets are used to find k+1 itemsets.
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is
its anti-monotonicity of support measure. Apriori assumes that All subsets of a frequent itemset
must be frequent(Apriori propertry). If an itemset is infrequent, all its supersets will be infrequent.
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate
set)
(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives
us itemset L1.
Step-2: K=2
• Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and
• Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.
(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
• Now find support count of these itemsets by searching in dataset.
II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.
Step-3:
• Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not, then remove that itemset.
(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4},
subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
• find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L3.
Step-4:
• Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4) is
that, they should have (K-2) elements in common. So here, for L3, first 2 elements (items)
should match.
• Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is
{I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
• We stop here because no frequent itemsets are found further
Thus, we have discovered all the frequent item-sets. Now generation of strong association rule
comes into picture. For that we need to calculate confidence of each rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought
butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.
Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of
candidate sets with much frequent itemsets, low minimum support or large itemsets i.e. it is not an
efficient approach for large number of datasets. For example, if there are 10^4 from frequent 1-
itemsets, it need to generate more than 10^7 candidates into 2-length which in turn they will be
tested and accumulate. Furthermore, to detect frequent pattern in size 100 i.e. v1, v2… v100, it have
to generate 2^100 candidate itemsets that yield on costly and wasting of time of candidate
generation. So, it will check for many sets from candidate itemsets, also it will scan database many
times repeatedly for finding candidate itemsets. Apriori will be very low and inefficiency when
memory capacity is limited with large number of transactions.
Some machine learning tools or libraries may be limited by a default memory configuration.
Check if you can re-configure your tool or library to allocate more memory.
A good example is Weka, where you can increase the memory as a parameter when starting the
application.
Take a random sample of your data, such as the first 1,000 or 100,000 rows. Use this smaller sample
to work through your problem before fitting a final model on all of your data (using progressive
data loading techniques).
Perhaps you can get access to a much larger computer with an order of magnitude more memory.
For example, a good option is to rent compute time on a cloud service like Amazon Web Services
that offers machines with tens of gigabytes of RAM for less than a US dollar per hour.
Perhaps you can speed up data loading and use less memory by using another data format. A good
example is a binary format like GRIB, NetCDF, or HDF.
There are many command line tools that you can use to transform one data format into another that
do not require the entire dataset to be loaded into memory.
Using another format may allow you to store the data in a more compact form that saves memory,
such as 2-byte integers, or 4-byte floats.
Perhaps you can use code or a library to stream or progressively load data as-needed into memory
for training.
This may require algorithms that can learn iteratively using optimization techniques such as
stochastic gradient descent, instead of algorithms that require all data in memory to perform matrix
operations such as some implementations of linear and logistic regression.
6. Use a Relational Database
Relational databases provide a standard way of storing and accessing very large datasets.
Internally, the data is stored on disk can be progressively loaded in batches and can be queried using
a standard query language (SQL).
Free open source database tools like MySQL or Postgres can be used and most (all?) programming
languages and many machine learning tools can connect directly to relational databases. You can
also use a lightweight approach, such as SQLite.
That is, a platform designed for handling very large datasets, that allows you to use data transforms
and machine learning algorithms on top of it.
Two good examples are Hadoop with the Mahout machine learning library and Spark wit the
MLLib library.
◆ Next: algorithms that find all or most frequent itemsets using at most 2 passes over data
♦The Frequent Items Problem (aka Heavy Hitters): given stream of Nitems, find those that occur
most frequently
♦Many practical applications–Search log mining, network data analysis, DBMS optimization
• Given a stream of items, the problem is simply to find those items which occur most
frequently.
• Formalized as finding all items whose frequency exceeds a specified fraction of the total
number of items.
• Variations arise when the items are given weights, and further when these weights can also
be negative.
• The problem is important both in itself and as a subroutine in more advanced computations.
• For example, It can help in routing decisions, for in-network caching etc (if items represent
packets on the Internet).
• Can help in finding popular terms if items represent queries made to an Internet search
engine.
• Mining frequent itemsets inherently builds on this problem as a basic building block.
• Algorithms for the problem have been applied by large corporations: AT&T and Google.
Solutions
Two main classes of algorithms :
• Counter-based Algorithms
• Sketch Algorithms
Other Solutions :
• Quantiles : based on various notions of randomly sampling items from the input, and
of summarizing the distribution of items.
• Less effective and have attracted less interest.
• Track a subset of items from the input and monitor their counts.
Frequent Algorithm
Clustering Technqiues
In other words, the clusters are regions where the density of similar data points is high. It is
generally used for the analysis of the data set, to find insightful data among huge data sets and draw
inferences from it. Generally, the clusters are seen in a spherical shape, but it is not necessary as the
clusters can be of any shape.
It depends on the type of algorithm we use which decides how the clusters will be created. The
inferences that need to be drawn from the data sets also depend upon the user as there is no criterion
for good clustering.
In this method, the clusters are created based upon the density of the data points which are
represented in the data space. The regions that become dense due to the huge number of data points
residing in that region are considered as clusters.
The data points in the sparse region (the region where the data points are very less) are considered
as noise or outliers. The clusters created in these methods can be of arbitrary shape. Following are
the examples of Density-based clustering algorithms:
DBSCAN groups data points together based on the distance metric and criterion for a minimum
number of data points. It takes two parameters – eps and minimum points. Eps indicates how close
the data points should be to be considered as neighbors. The criterion for minimum points should be
completed to consider that region as a dense region.
It is similar in process to DBSCAN, but it attends to one of the drawbacks of the former algorithm
i.e. inability to form clusters from data of arbitrary density. It considers two more parameters which
are core distance and reachability distance. Core distance indicates whether the data point being
considered is core or not by setting a minimum value for it.
Reachability distance is the maximum of core distance and the value of distance metric that is used
for calculating the distance among two data points. One thing to consider about reachability
distance is that its value remains not defined if one of the data points is a core point.
Hierarchical Clustering
o Single Linkage: – In single linkage the distance between the two clusters is the shortest
distance between points in those two clusters.
o Complete Linkage: – In complete linkage, the distance between the two clusters is the farthest
distance between points in those two clusters.
o Average Linkage: – In average linkage the distance between the two clusters is the average
distance of every point in the cluster with every point in another cluster.
Fuzzy Clustering
In fuzzy clustering, the assignment of the data points in any of the clusters is not decisive. Here, one
data point can belong to more than one cluster. It provides the outcome as the probability of the data
point belonging to each of the clusters. One of the algorithms used in fuzzy clustering is Fuzzy c-
means clustering.
This algorithm is similar in process to the K-Means clustering and it differs in the parameters that
are involved in the computation like fuzzifier and membership values.
Partitioning Clustering
This method is one of the most popular choices for analysts to create clusters. In partitioning
clustering, the clusters are partitioned based upon the characteristics of the data points. We need to
specify the number of clusters to be created for this clustering method. These clustering algorithms
follow an iterative process to reassign the data points between clusters based upon the distance. The
algorithms that fall into this category are as follows: –
o K-Means Clustering: – K-Means clustering is one of the most widely used algorithms. It
partitions the data points into k clusters based upon the distance metric used for the clustering. The
value of ‘k’ is to be defined by the user. The distance is calculated between the data points and the
centroids of the clusters.
The data point which is closest to the centroid of the cluster gets assigned to that cluster. After an
iteration, it computes the centroids of those clusters again and the process continues until a pre-
defined number of iterations are completed or when the centroids of the clusters do not change after
an iteration.
It is a very computationally expensive algorithm as it computes the distance of every data point with
the centroids of all the clusters at each iteration. This makes it difficult for implementing the same
for huge data sets.
This algorithm is also called as k-medoid algorithm. It is also similar in process to the K-means
clustering algorithm with the difference being in the assignment of the center of the cluster. In PAM,
the medoid of the cluster has to be an input data point while this is not true for K-means clustering
as the average of all the data points in a cluster may not belong to an input data point.
Grid-Based Clustering
In grid-based clustering, the data set is represented into a grid structure which comprises of grids
(also called cells). The overall approach in the algorithms of this method differs from the rest of the
algorithms.
They are more concerned with the value space surrounding the data points rather than the data
points themselves. One of the greatest advantages of these algorithms is its reduction in
computational complexity. This makes it appropriate for dealing with humongous data sets.
After partitioning the data sets into cells, it computes the density of the cells which helps in
identifying the clusters. A few algorithms based on grid-based clustering are as follows: –
o STING (Statistical Information Grid Approach): – In STING, the data set is divided
recursively in a hierarchical manner. Each cell is further sub-divided into a different number of
cells. It captures the statistical measures of the cells which helps in answering the queries in a small
amount of time.
o WaveCluster: – In this algorithm, the data space is represented in form of wavelets. The data
space composes an n-dimensional signal which helps in identifying the clusters. The parts of the
signal with a lower frequency and high amplitude indicate that the data points are concentrated.
These regions are identified as clusters by the algorithm. The parts of the signal where the
frequency high represents the boundaries of the clusters.
o PROCLUS:- The PROCLUS algorithm uses a top-down approach which creates clusters that
are partitions of the data sets, where each data point is assigned to only one cluster which is highly
suitable for customer segmentation and trend analysis where a partition of points is required.
Typical examples of frequent pattern–based cluster analysis include the clustering of text
documents that contain thousands of distinct keywords, and the analysis of microarray data that
contain tens of thousands of measured values or “features.”Discovering clusters in subspaces, or
subspace clustering and related clustering paradigms, is a research field where we find many
frequent pattern mining related influences. In fact, as the first algorithms for subspace clustering
were based on frequent pattern mining algorithms, it is fair to say that frequent pattern mining was
at the cradle of subspace clustering—yet, it quickly developed into an independent research field.
•Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently
in a data set
•First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsetsand
association rule mining
•Applications–Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web
log (click stream) analysis, and DNA sequence analysis.
–Broad applications
Main problem: We use distance measures such as mentioned at the beginning. So we can’t base
distances on location of points.The problem arises when we need to represent a cluster, Because we
cannot replace a location point by centriod point.
3.The sum of the squares of the distances to the other points in the cluster.
Stopping criterion:
•Uses criterions not directly using centroids, except the radius which is valid also to Non-Euclidean
spaces.
In recent years, the management and processing of so-called data streams has become a topic of
active research in several fields of computer science such as, e.g., distributed systems, database
systems, and data mining. A data stream can roughly be thought of as a transient, continuously
increasing sequence of time-stamped data. In this paper, we consider the problem of clustering
parallel streams of real-valued data, that is to say, continuously evolving time series. In other words,
we are interested in grouping data streams the evolution over time of which is similar in a specific
sense. In order to maintain an up-to-date clustering structure, it is necessary to analyze the incoming
data in an online manner, tolerating not more than a constant time delay.
Unit-V
MapReduce
MapReduce is a framework using which we can write applications to process huge amounts of data,
in parallel, on large clusters of commodity hardware in a reliable manner.
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes
a set of data and converts it into another set of data, where individual elements are broken down
into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input
and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called mappers
and reducers. Decomposing a data processing application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the MapReduce form, scaling the application to run
over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a
configuration change. This simple scalability is what has attracted many programmers to use the
MapReduce model.
The Algorithm
• Generally MapReduce paradigm is based on sending the computer to where the data resides!
• MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
• Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
• Reduce stage − This stage is the combination of the Shuffle stage and the Reduce
stage. The Reducer’s job is to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
Hadoop
Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to scale
up from single servers to thousands of machines, each offering local computation and storage.
Hadoop is an open source, Java based framework used for storing and processing big data. The data
is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables
concurrent processing and fault tolerance. Developed by Doug Cutting and Michael J. Cafarella,
Hadoop uses the MapReduce programming model for faster storage and retrieval of data from its
nodes. The framework is managed by Apache Software Foundation and is licensed under the
Apache License 2.0.
For years, while the processing power of application servers has been increasing manifold,
databases have lagged behind due to their limited capacity and speed. However, today, as many
applications are generating big data to be processed, Hadoop plays a significant role in providing a
much-needed makeover to the database world.
The 4 Modules of Hadoop
Hadoop is made up of "modules", each of which carries out a particular task essential for a
computer system designed for big data analytics.
1. Distributed File-System
The most important two are the Distributed File System, which allows data to be stored in an easily
accessible format, across a large number of linked storage devices, and the MapReduce - which
provides the basic tools for poking around in the data.
(A "file system" is the method used by a computer to store data, so it can be found and used.
Normally this is determined by the computer's operating system, however a Hadoop system uses its
own file system which sits "above" the file system of the host computer - meaning it can be
accessed using any computer running any supported OS).
2. MapReduce
MapReduce is named after the two basic operations this module carries out - reading data from the
database, putting it into a format suitable for analysis (map), and performing mathematical
operations i.e counting the number of males aged 30+ in a customer database (reduce).
3. Hadoop Common
The other module is Hadoop Common, which provides the tools (in Java) needed for the user's
computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.
4. YARN
The final module is YARN, which manages resources of the systems storing the data and running
the analysis.
Various other procedures, libraries or features have come to be considered part of the Hadoop
"framework" over recent years, but Hadoop Distributed File System, Hadoop MapReduce, Hadoop
Common and Hadoop YARN are the principle four.
How Hadoop Came About
Development of Hadoop began when forward-thinking software engineers realised that it was
quickly becoming useful for anybody to be able to store and analyze datasets far larger than can
practically be stored and accessed on one physical storage device (such as a hard disk).
This is partly because as physical storage devices become bigger it takes longer for the component
that reads the data from the disk (which in a hard disk, would be the "head") to move to a specified
segment. Instead, many smaller devices working in parallel are more efficient than one large one.
It was released in 2005 by the Apache Software Foundation, a non-profit organization which
produces open source software which powers much of the Internet behind the scenes. And if you're
wondering where the odd name came from, it was the name given to a toy elephant belonging to the
son of one of the original creators!
The Usage of Hadoop
The flexible nature of a Hadoop system means companies can add to or modify their data system as
their needs change, using cheap and readily-available parts from any IT vendor.
Today, it is the most widely used system for providing data storage and processing across
"commodity" hardware - relatively inexpensive, off-the-shelf systems linked together, as opposed to
expensive, bespoke systems custom-made for the job in hand. In fact it is claimed that more than
half of the companies in the Fortune 500 make use of it.
Just about all of the big online names use it, and as anyone is free to alter it for their own purposes,
modifications made to the software by expert engineers at, for example, Amazon and Google, are
fed back to the development community, where they are often used to improve the "official"
product. This form of collaborative development between volunteer and commercial users is a key
feature of open source software.
In its "raw" state - using the basic modules supplied here http://hadoop.apache.org/ by Apache, it
can be very complex, even for IT professionals - which is why various commercial versions have
been developed such as Cloudera which simplify the task of installing and running a Hadoop
system, as well as offering training and support services.
So that, in a (fairly large) nutshell, is Hadoop. Thanks to the flexible nature of the system,
companies can expand and adjust their data analysis operations as their business expands. And the
support and enthusiasm of the open source community behind it has led to great strides towards
making big data analysis more accessible for everyone.
Pig:
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to
write complex data transformations without knowing Java. ... Pig works with data from many
sources, including structured and unstructured data, and store the results into the Hadoop Data File
System.
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger
sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all
the data manipulation operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin. This
language provides various operators using which programmers can develop their own functions for
reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All
these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known
as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce
jobs.
• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar
with SQL.
• Apache Pig provides many built-in operators to support data operations like joins, filters,
ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that
are missing from MapReduce.
Features of Pig
Apache Pig comes with the following features −
• Rich set of operators − It provides many operators to perform operations like join, sort,
filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if
you are good at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their execution
automatically, so the programmers need to focus only on semantics of the language.
• Extensibility − Using the existing operators, users can develop their own functions to read,
process, and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well
as unstructured. It stores the results in HDFS.
Apache Pig Vs MapReduce
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Features of Hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Architecture of Hive
The following component diagram depicts the architecture of Hive:
Hbase
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random access
to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File
System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the
data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides
read and write access.
MapR was a business software company headquartered in Santa Clara, California. MapR software
provides access to a variety of data sources from a single computer cluster, including big data
workloads such as Apache Hadoop and Apache Spark, a distributed file system, a multi-model
database management system, and event stream processing, combining analytics in real-time with
operational applications. Its technology runs on both commodity hardware and public cloud
computing services.
Sharding
Sharding is the process of breaking up large tables into smaller chunks called shards that are spread
across multiple servers. A shard is essentially a horizontal data partition that contains a subset of
the total data set, and hence is responsible for serving a portion of the overall workload.Sharding is
the process of breaking up large tables into smaller chunks called shards that are spread across
multiple servers. A shard is essentially a horizontal data partition that contains a subset of the total
data set, and hence is responsible for serving a portion of the overall workload.
Business applications that rely on a monolithic RDBMS hit bottlenecks as they grow. With limited
CPU, storage capacity, and memory, query throughput and response times are bound to suffer.
When it comes to adding resources to support database operations, vertical scaling (aka scaling up)
has its own set of limits and eventually reaches a point of diminishing returns.
On the other hand, horizontally partitioning a table means more compute capacity to serve incoming
queries, and therefore you end up with faster query response times and index builds. By
continuously balancing the load and data set over additional nodes, sharding also enables usage of
additional capacity. Moreover, a network of smaller, cheaper servers may be more cost effective in
the long term than maintaining one big server.
Besides resolving scaling challenges, sharding can potentially alleviate the impact of unplanned
outages. During downtime, all the data in an unsharded database is inaccessible, which can be
disruptive or downright disastrous. When done right, sharding can ensure high availability: even if
one or two nodes hosting a few shards are down, the rest of the database is still available for
read/write operations as long as the other nodes (hosting the remaining shards) run in different
failure domains. Overall, sharding can increase total cluster storage capacity, speed up processing,
and offer higher availability at a lower cost than vertical scaling.
NOSQL
A NoSQL database provides a mechanism for storing and retrieving data that is modeled in ways
other than the those used in relational databases and RDBMS. Such databases have existed since the
late 1960s, but were not called NoSQL until a surge in popularity during the early 2000s, triggered
by the needs of Web 2.0 companies such as Facebook, Google, and Amazon.
NoSQL databases are increasingly used in big data and real-time web applications. Hadoop enables
certain types of NoSQL distributed databases (such as HBase), which allow data to be spread across
thousands of servers with little reduction in performance. Modern non-relational and cloud
databases now make up 70-percent of data sources for analytics. And it's becoming common for
companies to gain a deeper understanding of their customers by querying NoSQL data and
combining it with data, including unstructured data, residing in Salesforce and Web transaction data
in Hadoop.
NoSQL is an approach to database design that can accommodate a wide variety of data models,
including key-value, document, columnar and graph formats. NoSQL, which stands for “not only
SQL,” is an alternative to traditional relational databases in which data is placed in tables and data
schema is carefully designed before the database is built. NoSQL databases are especially useful for
working with large sets of distributed data.
The NoSQL term can be applied to some databases that predated the relational database
management system (RDBMS), but it more commonly refers to the databases built in the early
2000s for the purpose of large-scale database clustering in cloud and web applications. In these
applications, requirements for performance and scalability outweighed the need for the immediate,
rigid data consistency that the RDBMS provided to transactional enterprise applications.
NoSQL helps deal with the volume, variety, and velocity requirements of big data:
Volume: Maintaining the ACID properties (Atomicity, Consistency, Isolation, Durability) is
expensive and not always necessary. Sometimes, we can deal with minor inconsistencies in our
results. We thus want to be able to partition our data multiple sites.
• Variety: One single fixed data model makes it harder to incorporate varying data.
Sometimes, when we pull from external sources, we don’t know the schema! Furthermore,
changing a schema in a relational database can be expensive.
• Velocity: Storing everything durable to a disk all the time can be prohibitively expensive.
Sometimes it’s okay if we have a low probability of losing data. Memory is much cheaper
now, and much faster than always going to disk.
There is no single accepted definition of NoSQL, but here are its main characteristics:
• It has quite a flexible schema, unlike the relational model. Different rows may have
different attributes or structure. The database often has no understanding of the schema. It is
up to the applications to maintain consistency in the schema including any denormalization.
• It also is often better at handling really big data tasks. This is because NoSQL databases
follow the BASE (Basically Available, Soft state, Eventual consistency) approach instead of
ACID.
• In NoSQL, consistency is only guaranteed after some period of time when writes stop. This
means it is possible that queries will not see the latest data. This is commonly implemented
by storing data in memory and then lazily sending it to other machines.
• Finally, there is this notion known as the CAP theorem — pick 2 out of 3 things:
Consistency, Availability, and Partition tolerance. ACID databases are usually CP systems,
while BASE databases are usually AP. This distinction is blurry and often systems can be
reconfigured to change these tradeoffs.
S3
By using Amazon S3 analytics Storage Class Analysis you can analyze storage access patterns to
help you decide when to transition the right data to the right storage class. This new Amazon S3
analytics feature observes data access patterns to help you determine when to transition less
frequently accessed STANDARD storage to the STANDARD_IA (IA, for infrequent access)
storage class.
After storage class analysis observes the infrequent access patterns of a filtered set of data over a
period of time, you can use the analysis results to help you improve your lifecycle policies. You can
configure storage class analysis to analyze all the objects in a bucket. Or, you can configure filters
to group objects together for analysis by common prefix (that is, objects that have names that begin
with a common string), by object tags, or by both prefix and tags. You'll most likely find that
filtering by object groups is the best way to benefit from storage class analysis.
You can have multiple storage class analysis filters per bucket, up to 1,000, and will receive a
separate analysis for each filter. Multiple filter configurations allow you analyze specific groups of
objects to improve your lifecycle policies that transition objects to STANDARD_IA.
Storage class analysis provides storage usage visualizations in the Amazon S3 console that are
updated daily. You can also export this daily usage data to an S3 bucket and view them in a
spreadsheet application, or with business intelligence tools, like Amazon QuickSight.
HADOOP Distributed File System
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-
cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files
are stored across multiple machines. These files are stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS also makes applications available to parallel
processing.
Features of HDFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to easily check the status of
cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
DFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and the
namenode software. It is a software that can be run on commodity hardware. The system having the
namenode acts as the master server and it does the following tasks −
• Manages the file system namespace.
• It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as per client request.
• They also perform operations such as block creation, deletion, and replication according to
the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual data nodes. These file segments are called as
blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change in HDFS
configuration.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS should have mechanisms for quick and
automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the network traffic and
increases the throughput.
Charts
The easiest way to show the development of one or several data sets is a chart. Charts vary from bar
and line charts that show the relationship between elements over time to pie charts that demonstrate
the components or proportions between the elements of one whole.
Plots
Plots allow to distribute two or more data sets over a 2D or even 3D space to show the relationship
between these sets and the parameters on the plot. Plots also vary. Scatter and bubble plots are some
of the most widely-used visualizations. When it comes to big data, analysts often use more complex
box plots that help visualize the relationship between large volumes of data.
Maps
Maps are popular ways to visualize data used in different industries. They allow to locate elements
on relevant objects and areas — geographical maps, building plans, website layouts, etc. Among the
most popular map visualizations are heat maps, dot distribution maps, cartograms.
Diagrams and matrices
Diagrams are usually used to demonstrate complex data relationships and links and include various
types of data on one visualization. They can be hierarchical, multidimensional, tree-like.
Matrix is one of the advanced data visualization techniques that help determine the correlation
between multiple constantly updating (steaming) data sets.
Interaction Techniques
Interaction techniques essentially involve data entry and manipulation, and thus place greater
emphasis on input than output. Output is merely used to convey affordances and provide user
feedback. The use of the term input technique further reinforces the central role of input.
Interactive data visualization refers to the use of modern data analysis software that enables users to
directly manipulate and explore graphical representations of data. Data visualization uses visual
aids to help analysts efficiently and effectively understand the significance of data. Interactive data
visualization software improves upon this concept by incorporating interaction tools that facilitate
the modification of the parameters of a data visualization, enabling the user to see more detail,
create new insights, generate compelling questions, and capture the full value of the data.
Deciding what the best interactive data visualization will be for your project depends on your end
goal and the data available. Some common data visualization interactions that will help users
explore their data visualizations include:
• Brushing: Brushing is an interaction in which the mouse controls a paintbrush that directly
changes the color of a plot, either by drawing an outline around points or by using the brush
itself as a pointer. Brushing scatterplots can either be persistent, in which the new
appearance is retained once the brush has been removed, or transient, in which changes only
remain visible while the active plot is enclosed or intersected by the brush. Brushing is
typically used when multiple plots are visible and a linking mechanism exists between the
plots.
• Painting: Painting refers to the use of persistent brushing, followed by subsequent
operations such as touring to compare the groups.
• Identification: Identification, also known as label brushing or mouseover, refers to the
automatic appearance of an identifying label when the cursor hovers over a particular plot
element.
• Scaling: Scaling can be used to change a plot’s aspect ratio, revealing different data features.
Scaling is also commonly used to zoom in on dense regions of a scatter plot.
• Linking: Linking connects selected elements on different plots. One-to-one linking entails
the projection of data on two different plots, in which a point in one plot corresponds to
exactly one point in the other. Elements may also be categorical variables, in which all data
values corresponding to that category are highlighted in all the visible plots. Brushing an
area in one plot will brush all cases in the corresponding category on another plot.
System and application in data visualization
Data Visualization Application lets you quickly create insightful data visualizations, in minutes. It
allows users to visualize data using drag & drop, create interactive dashboards and customize them
with a few clicks. Data visualization tools allow anyone to organize and present information
intuitively. They enables users to share data visualizations with others. People can create interactive
data visualizations to understand data, ask business questions and find answers quickly.
Introduction to R
R is a command line driven program. The user enters commands at the prompt ( > by default ) and
each command is executed one at a time.
There have been a number of attempts to create a more graphical interface, ranging from code
editors that interact with R, to full-blown GUIs that present the user with menus and dialog boxes.
data import and export in R
library("readr")
# Read tab separated values
read_tsv(file.choose())
# Read comma (",") separated values
read_csv(file.choose())
# Read semicolon (";") separated values
read_csv2(file.choose())
library("xlsx")
# Write the first data set in a new workbook
write.xlsx(USArrests, file = "myworkbook.xlsx",
sheetName = "USA-ARRESTS", append = FALSE)
# Add a second data set in a new worksheet
write.xlsx(mtcars, file = "myworkbook.xlsx",
sheetName="MTCARS", append=TRUE)
Attribute and data types in R
• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic
vectors, also termed as six classes of vectors.
Vectors
When you want to create vector with more than one element, you should use c() function which
means to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the
matrix function.
Live Demo
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The
array function takes a dim attribute which creates the required number of dimension. In the below
example we create an array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the distinct
values of the elements in the vector as labels. The labels are always character irrespective of
whether it is numeric or character or Boolean etc. in the input vector. They are useful in statistical
modeling.
Factors are created using the factor() function. The nlevels functions gives the count of levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be character
and third column can be logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
Descriptive Statistics in R
dat <- iris # load the iris dataset and renamed it dat
max(dat$Sepal.Length)
Range
The range can then be easily computed, as you have guessed, by subtracting the minimum from the
maximum:
max(dat$Sepal.Length) - min(dat$Sepal.Length)
## [1] 3.6
Mean
The mean can be computed with the mean() function:
mean(dat$Sepal.Length)
## [1] 5.843333
Median
The median can be computed thanks to the median() function:
median(dat$Sepal.Length)
## [1] 5.8
First and third quartile
As the median, the first and third quartiles can be computed thanks to the quantile() function
and by setting the second argument to 0.25 or 0.75:
quantile(dat$Sepal.Length, 0.25) # first quartile
## 25%
## 5.1
## 75%
## 6.4
You may have seen that the results above are slightly different than the results you would have
found if you compute the first and third quartiles by hand. It is normal, there are many methods to
compute them (R actually has 7 methods to compute the quantiles!). However, the methods
presented here and in the article “descriptive statistics by hand” are the easiest and most “standard”
ones. Furthermore, results do not dramatically change between the two methods.
Other quantiles
As you have guessed, any quantile can also be computed with the quantile() function. For
instance, the 4th
decile or the 98th
percentile:
quantile(dat$Sepal.Length, 0.4) # 4th decile
## 40%
## 5.6
## 98%
## 7.7
Interquartile range
The interquartile range (i.e., the difference between the first and third quartile) can be computed
with the IQR() function:
IQR(dat$Sepal.Length)
## [1] 1.3
or alternatively with the quantile() function again:
quantile(dat$Sepal.Length, 0.75) - quantile(dat$Sepal.Length,
0.25)
## 75%
## 1.3
As mentioned earlier, when possible it is usually recommended to use the shortest piece of code to
arrive at the result. For this reason, the IQR() function is preferred to compute the interquartile
range.
## [1] 0.8280661
var(dat$Sepal.Length) # variance
## [1] 0.6856935
standard deviation and the variance are different whether we compute it for a sample or a
population (see the difference between sample and population). In R, the standard deviation and the
variance are computed as if the data represent a sample (so the denominator is n−1
, where n is the number of observations). To my knowledge, there is no function by default in R that
computes the standard deviation or variance for a population.
Tip: to compute the standard deviation (or variance) of multiple variables at the same time, use
lapply() with the appropriate statistics as second argument:
lapply(dat[, 1:4], sd)
## $Sepal.Length
## [1] 0.8280661
##
## $Sepal.Width
## [1] 0.4358663
##
## $Petal.Length
## [1] 1.765298
##
## $Petal.Width
## [1] 0.7622377
There are 2 problems that we can spot immediately. The last column is ‘factor’ and not ‘numeric’
like what we desire. Secondly, the first column ‘Country name’ is encoded differently from the raw
dataset.
str(df.raw2)
What about the last scenario?
df.raw <- read.csv(file ='Pisa scores 2013 - 2015 Data.csv', fileEncoding="UTF-8-BOM", na.strings
= '..')
str(df.raw)
If we use the dataset above, we will not be able to draw a boxplot. This is because boxplot needs
only 2 variables x and y but in the cleaned data that we have, there are so many variables. So we
need to combine those into 2 variables. We name this as df2
df2 = df[,c(1,3,4,6,7,9,10)] %>% # select relevant columns
pivot_longer(c(2,3,4,5,6,7),names_to = 'Score')view(df2)
geom_boxplot()+
scale_color_brewer(palette="Dark2") +
geom_jitter(shape=16, position=position_jitter(0.2))+
labs(title = 'Did males perform better than females?',
y='Scores',x='Test Type')
Correlation Plot
The vast majority of data that businesses deal with these days is unstructured. In fact, IDG Research
estimates that 85% of all data will be unstructured by 2025. There are huge insights to be gathered
from this data, but they’re hard to draw out.
Once you learn how to break down unstructured data and analyze it, however, you can perform
unstructured data analytics automatically, with little need for human input.
Unstructured data has no set framework or regular design. It is usually qualitative data, like images,
audio, and video, but most of it is unstructured text data: documents, social media data, emails,
open-ended surveys, etc.
Unstructured text data goes beyond just numerical values and facts, into thoughts, opinions, and
emotions. It can be analyzed to provide both quantitative and qualitative results: follow market
trends, monitor brand reputation, understand the voice of the customer (VoC), and more.