Data Analytics
Data Analytics
Big data is a term used to describe data of great variety, huge volumes, and even more velocity.
Apart from the significant volume, big data is also complex such that none of the conventional
data management tools can effectively store or process it. The data can be structured or
unstructured.
Health records
Transactional data
Web searches
Financial documents
Weather information.
Big data can be generated by users (emails, images, transactional data, etc.), or machines (IoT,
ML algorithms, etc.). And depending on the owner, the data can be made commercially available
to the public through API or FTP. In some instances, it may require a subscription for you to be
granted access to it.
Big data is a term used to describe large volumes of data that are hard to manage. Due to its large
size and complexity, traditional data management tools cannot store or process it efficiently.
There are three types of big data:
Structured
Unstructured
Semi-structured
1. Structured:
Big data can be stored, accessed, and processed in a fixed format. Although recent
advancements in computer science have made it possible to process such data, experts agree that
issues might arise when the data grows to a huge extent.
2. Unstructured:
Data is data whose form and structure are undefined. In addition to being large,
unstructured data also poses multiple challenges in terms of processing. Large organizations
have data sources containing a combination of text, video, and image files. Despite having such
an abundance of data, they still struggle to derive value from it due to its intricate format.
3. Semi-structured:
Data contains both structured and unstructured data. At its essence, we can view semi-
structured data in a structured form, but it is not clearly defined, just like in this XML file .
IBM describes the phenomenon of big data through the four V’s:
volume
velocity
variety and
Veracity
Volume:- Because data are collected electronically, we are able to collect more of it. To be
useful, these data must be stored, and this storage has led to vast quantities of data. Many
companies now store in excess of 100 terabytes of data (a terabyte is 1,024 gigabytes).
Velocity:- Real-time capture and analysis of data present unique challenges both in how data
are stored and the speed with which those data can be analyzed for decision making.
Ex:- The New York Stock Exchange collects 1 terabyte of data in a single trading session,
and having current data and real-time rules for trades and predictive modeling are important
for managing stock portfolios.
Variety:- In addition to the sheer volume and speed with which companies now collect data,
more complicated types of data are now available and are proving to be of great value to
businesses.
Text data are collected by monitoring what is being said about a company’s products or
services on social media platforms such as Twitter.
Audio data are collected from service calls (on a service call, you will often hear “this
call may be monitored for quality control”).
Video data collected by in-store video cameras are used to analyze shopping behavior.
Analyzing information generated by these nontraditional sources is more complicated in
part because of the processing required to transform the data into a numerical form that
can be analyzed.
Veracity :- Veracity has to do with how much uncertainty is in the data. For example, the
data could have many missing values, which makes reliable analysis a challenge.
Inconsistencies in units of measure and the lack of reliability of responses in terms of bias
also increase the complexity of the data. This has led to new technologies like Hadoop &
MapReduce….
The constant stream of information from various sources is becoming more intense,
especially with the advance in technology. And this is where big data platforms come in to store
and analyze the ever-increasing mass of information.
A big data platform is an integrated computing solution that combines numerous software
systems, tools, and hardware for big data management. It is a one-stop architecture that solves all
the data needs of a business regardless of the volume and size of the data at hand. Due to their
efficiency in data management, enterprises are increasingly adopting big data platforms to gather
tons of data and convert them into structured, actionable business insights.
Currently, the marketplace is flooded with numerous Open source and commercially available
big data platforms. They boast different features and capabilities for use in a big data
environment.
a) Big Data platform should be able to accommodate new platforms and tool based on the
business requirement. Because business needs can change due to new technologies or due to
change in business process.
b) It should support linear scale-out
c) It should have capability for rapid deployment
d) It should support variety of data format
e) Platform should provide data analysis and reporting tools
f) It should provide real-time data analysis software
g) It should have tools for searching the data through large data sets
a) Hadoop:
Hadoop is open-source, Java based programming framework and server software which is used
to save and analyze data with the help of 100s or even 1000s of commodity servers in a clustered
environment. Hadoop is designed to storage and process large datasets extremely fast and in
fault tolerant way.
Hadoop uses HDFS (Hadoop File System) for storing data on cluster of commodity
computers. If any server goes down it know how to replicate the data and there is no loss of data
even in hardware failure.
b) Cloudera:
Cloudera is one of the first commercial Hadoop based Big Data Analytics Platform offering
Big Data solution.
Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera Data
Science & Engineering and Cloudera Essentials.
All these products are based on the Apache Hadoop and provides real-time processing and
analytics of massive data sets.
c) Amazon Web Services:
Amazon is offering Hadoop environment in cloud as part of its Amazon Web Services
package.
AWS Hadoop solution is hosted solution which runs on Amazon’s Elastic Cloud Compute and
Simple Storage Service (S3).
Enterprises can use the Amazon AWS to run their Big Data processing analytics in the cloud
environment.
Amazon EMR allows companies to setup and easily scale Apache Hadoop, Spark, HBase,
Presto, Hive, and other Big Data Frameworks using its cloud hosting environment
d) Horton works:
Horton works is using 100% open-source software without any propriety software. Horton
works were the one who first integrated support for Apache HCatalog.
The Horton works is a Big Data company based in California.
This company is developing and supports application for Apache Hadoop. Horton works
Hadoop distribution is 100% open source and its enterprise ready with following features:
Centralized management and configuration of clusters
Security and data governance are built in feature of the system
Centralized security administration across the system
e) Map R:
Map R is another Big Data platform which us using the Unix file system for handling data.
It is not using HDFS and this system is easy to learn anyone familiar with the Unix system.
This solution integrates Hadoop, Spark, and Apache Drill with a real-time data processing
feature.
Web Data and use cases of Web Data:
Web data is an incredibly broad term. It encompasses a wide range of information which is
collected from websites and apps about different users’ browsing habits, online behaviors and
preferences. It can also include information about the consumer themselves, such as their details,
search and purchase intent or online interests. Examples of web data include online product
reviews, social media posts, website traffic statistics, and search engine results.
Use Cases
E-commerce Price Monitoring
One of the main use cases of web data is e-commerce price monitoring. With the vast amount of
products and prices available online, businesses can leverage web data to track and monitor the
prices of their competitors’ products. By collecting data from various e-commerce websites,
businesses can gain insights into market trends, identify pricing strategies, and adjust their own
pricing accordingly. This use case helps businesses stay competitive and make informed pricing
decisions.
Sentiment Analysis and Brand Monitoring
Web data is also widely used for sentiment analysis and brand monitoring. By analyzing data
from social media platforms, review websites, and online forums, businesses can gain valuable
insights into customer opinions, feedback, and sentiments towards their brand or products. This
use case allows businesses to understand customer preferences, identify areas for improvement,
and manage their brand reputation effectively.
Market Research and Trend Analysis
Web data is a valuable resource for market research and trend analysis. By collecting data from
various sources such as news websites, blogs, and industry forums, businesses can gather
information about market trends, consumer behaviour, and emerging technologies. This use case
helps businesses make data-driven decisions, identify new market opportunities, and stay ahead
of their competitors.
These are just a few examples of the main use cases of web data. The versatility and abundance
of web data make it a valuable asset for businesses across various industries.
Main Attributes of Web Data
Web data refers to the vast amount of information available on the internet, encompassing
various attributes that can be associated with it. Some possible attributes of web data include the
source or website from which the data originates, the date and time of data retrieval, the format
in which the data is presented (such as HTML, XML, JSON), the structure of the data (such as
tables, lists, or graphs), the content or topic of the data (ranging from news articles and social
media posts to scientific research papers and e-commerce product listings), and the metadata
associated with the data (such as author, title, keywords, and tags). Additionally, web data can
have attributes related to its accessibility, quality, reliability, and licensing.
4. Technical challenges:
Quality of data:
When there is a collection of a large amount of data and storage of this data, it comes
at a cost. Big companies, business leaders and IT leaders always want large data
storage.
For better results and conclusions, Big data rather than having irrelevant data,
focuses on quality data storage.
This further arise a question that how it can be ensured that data is relevant, how
much data would be enough for decision making and whether the stored data is
accurate or not.
Fault tolerance:
Fault tolerance is another technical challenge and fault tolerance computing is
extremely hard, involving intricate algorithms.
Nowadays some of the new technologies like cloud computing and big data always
intended that whenever the failure occurs the damage done should be within the
acceptable threshold that is the whole task should not begin from the scratch.
Scalability:
Big data projects can grow and evolve rapidly. The scalability issue of Big Data has
lead towards cloud computing.
It leads to various challenges like how to run and execute various jobs so that goal
of each workload can be achieved cost-effectively.
It also requires dealing with the system failures in an efficient manner. This leads to
a big question again that what kinds of storage devices are to be used.
Modern tools of Big Data:
1. Apache Hadoop:
A large data framework is the Apache Hadoop software library. It enables massive data sets to
be processed across clusters of computers in a distributed manner. It's one of the most powerful
big data technologies, with the ability to grow from a single server to thousands of computers
Features
• When utilizing an HTTP proxy server, authentication is improved.
• Hadoop Compatible File system effort specification. Extended characteristics for POSIX- style
file systems are supported.
• It has big data technologies and tools that offers robust ecosystem that is well suited to meet the
analytical needs of developer.
• It brings Flexibility in Data Processing. It allows for faster data Processing
2. HPCC:
HPCC is a big data tool developed by LexisNexis Risk Solution. It delivers on a single platform,
a single architecture and a single programming language for data processing
Features
• It is one of the Highly efficient big data tools that accomplish big data tasks with far less code.
• It is one of the big data processing tools which offers high redundancy and availability.
• It can be used both for complex data processing on a Thor cluster. Graphical IDE for simplifies
development, testing and debugging. It automatically optimizes code for parallel processing
• Provide enhance scalability and performance. ECL code compiles into optimized C++, and it
can also extend using C++ libraries.
3. Apache STORM:
Storm is a free big data open source computation system. It is one of the best big data tools
which offers distributed real-time, fault-tolerant processing system. With real-time computation
capabilities.
Features
• It is one of the best tool from big data tools list which is benchmarked as processing one
million 100 byte messages per second per node
• It has big data technologies and tools that uses parallel calculations that run across a cluster of
machines.
• It will automatically restart in case a node dies. The worker will be restarted on another node.
Storm guarantees that each unit of data will be processed at least once or exactly once
• Once deployed Storm is surely easiest tool for Big data analysis.
4. Qubole Qubole:
Data is Autonomous Big data management platform. It is a big data open-source tool which is
self-managed, self-optimizing and allows the data team to focus on business outcomes.
Features
• Single Platform for every use case
• It is an Open-source big data software having Engines, optimized for the Cloud.
• Comprehensive Security, Governance, and Compliance
• Provides actionable Alerts, Insights, and Recommendations to optimize reliability,
performance, and costs.
• Automatically enacts policies to avoid performing repetitive manual actions.
5. Apache Cassandra:
The Apache Cassandra database is widely used today to provide an effective management of
large amounts of data.
Features
• Support for replicating across multiple data centers by providing lower latency for users
• Data is automatically replicated to multiple nodes for fault-tolerance
• It one of the best big data tools which is most suitable for applications that can't afford to lose
data, even when an entire data center is down
• Cassandra offers support contracts and services are available from third parties.
6. CouchDB:
CouchDB stores data in JSON documents that can be accessed web or query using JavaScript. It
offers distributed scaling with fault-tolerant storage. It allows accessing data by defining the
Couch Replication Protocol.
Features
• CouchDB is a single-node database that works like any other database
• It is one of the big data processing tools that allows running a single logical database server on
any number of servers.
• It makes use of the ubiquitous HTTP protocol and JSON data format. Easy replication of a
database across multiple server instances. Easy interface for document insertion, updates,
retrieval and deletion
• JSON-based document format can be translatable across different languages.
7. Apache Flink:
Apache Flink is one of the best open source data analytics tools for stream processing big data. It
is distributed, high-performing, always-available, and accurate data streaming applications.
Features:
• Provides results that are accurate, even for out-of-order or late-arriving data
• It is shameful and fault-tolerant and can recover from failures.
• It is big data analytics software which can perform at a large scale, running on thousands of
nodes
• Has good throughput and latency characteristics
• This big data tool supports stream processing and windowing with event time semantics. It
supports flexible windowing based on time, count, or sessions to data-driven windows
• It supports a wide range of connectors to third-party systems for data sources and sinks
8. Cloudera:
Cloudera is the fastest, easiest and highly secure modern big data platform. It allows anyone to
get any data across any environment within single, scalable platform.
Features:
• High-performance big data analytics software
• It offers provision for multi-cloud
• Deploy and manage Cloudera Enterprise across AWS, Microsoft Azure and Google Cloud
Platform. Spin up and terminate clusters, and only pay for what is needed when need it.
• Developing and training data models
• Reporting, exploring, and self-servicing business intelligence
• Delivering real-time insights for monitoring and detection
• Conducting accurate model scoring and serving
Analytics Reporting
Analytics is the method of examining and Reporting is an action that includes all the needed
analyzing summarized data to make information and data and is put together in an
business decisions. organized way.
Questioning the data, understanding it, Identifying business events, gathering the required
investigating it, and presenting it to the information, organizing, summarizing, and
end users are all part of analytics. presenting existing data are all part of reporting.
The purpose of analytics is to draw The purpose of reporting is to organize the data
conclusions based on data. into meaningful information.
Analytics and reporting can be used to reach a number of different goals. Both of these can be
very helpful to a business if they are used correctly.
Importance of analytics vs reporting:
A business needs to understand the differences between analytics and reporting. Better data
knowledge through analytics and reporting helps businesses in decision-making and action inside
the organization. It results in higher value and performance.
Analytics is not really possible without advanced reporting, but analytics is more than just
reporting. Both tools are made for sharing important information that will help business people
make better decisions.
Sampling:
Sampling is a process of selecting group of observations from the population, to study the
characteristics of the data to make conclusion about the population.
Example: Covaxin (a covid-19 vaccine) is tested over thousands of males and females before
giving to all the people of country.
Types of Sampling:
Whether the data set for sampling is randomized or not, sampling is classified into two major
groups:
Probability Sampling
Non-Probability Sampling
Probability Sampling (Random Sampling):
In this type, data is randomly selected so that every observations of population gets the equal
chance to be selected for sampling.
Probability sampling is of 4 types:
Simple Random Sampling
Cluster Sampling
Stratified Sampling
Systematic Sampling
Non-Probability Sampling:
In this type, data is not randomly selected. It mainly depends upon how the statistician wants to
select the data. The results may or may not be biased with the population. Unlike probability
sampling, each observations of population doesn’t get the equal chance to be selected for
sampling.
Non-probability sampling is of 4 types:
Convenience Sampling
Judgmental/Purposive Sampling
Snowball/Referral Sampling
Quota Sampling.
Sampling Error:
Errors which occur during sampling process are known as Sampling Errors.
Or
Difference between observed value of a sample statistics and the actual value of a population
parameters.
Mathematical Formula for Sampling Error:
Bootstrapping:
In bootstrapping, samples are drawn with replacement (i.e. one observation can be repeated in
more than one group) and the remaining data which are not used in samples are used to test the
model.
Statistics is a branch of Mathematics that deals with the collection, analysis, interpretation and
the presentation of the numerical data. In other words, it is defined as the collection of
quantitative data. The main purpose of Statistics is to make an accurate conclusion using a
limited sample about a greater population.
Types of Statistical Inference
There are different types of statistical inferences that are extensively used for making
conclusions. They are:
One sample hypothesis testing
Confidence Interval
Pearson Correlation
Bi-variate regression
Multi-variate regression
Chi-square statistics and contingency table
ANOVA or T-test
3. Explain about Web Data and also explain abut use cases of web data.
4. What are conventional systems and explain about list of challenges of conventional systems.
UNIT-II
DATA ANALYSIS
Regression analysis:
Regression analysis is used to determining how the points or variables might be related. It
helps in determine the equation for a curve or line that might capture the relationship between
two variables.
Bi-variate Regression Analysis:
It is depend on the two variables and linear indicates that we are fitting straight line
through the data points. In linear regression the dependent variable denoted by ‘y’ and the
independent variable denoted by ‘x’.
The line of equation is given by
y=β0+β1x+ϵ
In the model y is dependent and x is independent variables. β 0 and β1 are the parameters of linear
model. ϵ is the error.
Multivariate Analysis:
Multivariate analysis is based on the observations and analysis of more than one
statistical outcome variable at a time. There are two types of multivariate techniques namely 1)
Dependence techniques and 2) interdependence techniques.
1) Dependence techniques:
Dependence methods are used when one or some of the variables are dependent on
others. In machine learning, dependence techniques are used to predictive models. Simple
example the dependent variable of “weight” might be predicted by independent variables such as
“height” and “age”.
2) Interdependence techniques:
These methods are used to understand the structural makeup and underlying patterns
within a dataset. In this case no variables are dependent on others, so you’re not looking for
casual relationships. Rather, interdependence methods seek to give meaning to a set of variables
or to group them together in meaningful ways.
Methods for Multivariate Analysis:
The following are the Multivariate Analysis techniques
Multiple Linear Regression:
Multiple linear regression is dependence method which looks at the relationship between
one dependent and more than one independent variables. This is useful as it helps you to
understand which factors are likely to influence a certain outcome, allowing you to estimate
future outcomes. For example growth of a crop is dependent on rainfall, temperature, fertilizers
and amount of sun light.
Multiple logistic regression:
Logistic Regression analysis is used to calculate the probability of a binary even
occurring. A binary outcome is one where there are only two possible outcomes; either the event
occurs (1) or it doesn’t (0). Based on independent variables, logistic regression can predict how
likely it is that a certain scenario will arise. In insurance sector analyst need to predict how likely
it is that each potential customer will make a claim.
Multivariate analysis of variance (MANOVA):
It is used to measure the effect of multiple independent variables on two or more
dependent variables. With this technique, it’s important to note that the independent variables are
categorical, while the dependent variables are metric in nature. For example an engineering
company that is on a mission to build a super-fast, eco-friendly rocket. In this example the
independent variables are
Engine type E1,E2 or E3
Material used for the rocket exterior.
Type of fuel used to power the rocket.
Bayesian Modeling:
The Bayesian technique is an approach in statistics used in data analysis and parameter
estimation. This approach is based on the Bayes theorem.
Bayesian Statistics follows a unique principle wherein it helps determine the joint
probability distribution for observed and unobserved parameters using a statistical model. The
knowledge of statistics is essential to tackle analytical problems in this scenario.
Ever since the introduction of the Bayes theorem in the 1770s by Thomas Bayes, it has
remained an indispensable tool in statistics. Bayesian models are a classic replacement for
frequents models as recent innovations in statistics have helped breach milestones in a wide
range of industries, including medical research, understanding web searches, and processing
natural languages (Natural Language Processing).
For example, Alzheimer’s is a disease known to pose a progressive risk as a person ages.
However, with the help of the Bayes theorem, doctors can estimate the probability of a person
having Alzheimer’s in the future. It also applies to cancer and other age-related illnesses that a
person becomes vulnerable to in the later years of his life.
Bayesian networks
Bayesian networks are a type of probabilistic graphical model that uses Bayesian inference for
probability computations. Bayesian networks aim to model conditional dependence, and therefore
causation, by representing conditional dependence by edges in a directed graph. Through these
relationships, one can efficiently conduct inference on the random variables in the graph through
the use of factors.
Probability
Before going into exactly what a Bayesian network is, it is first useful to review probability
theory.
First, remember that the joint probability distribution of random variables A_0, A_1, …, A_n,
denoted as P(A_0, A_1, …, A_n), is equal to P(A_1 | A_2, …, A_n) * P(A_2 | A_3, …, A_n) *
… * P(A_n) by the chain rule of probability. We can consider this a factorized representation of
given another random variable, C, is equivalent to satisfying the following property: P(A,B|C) =
P(A|C) * P(B|C). In other words, as long as the value of C is known and fixed, A and B are
independent. Another way of stating this, which we will use later on, is that P(A|B,C) = P(A|C).
Using the relationships specified by our Bayesian network, we can obtain a compact, factorized
A Bayesian network is a directed acyclic graph in which each edge corresponds to a conditional
dependency, and each node corresponds to a unique random variable. Formally, if an edge (A, B)
exists in the graph connecting random variables A and B, it means that P(B|A) is a factor in the
joint probability distribution, so we must know P(B|A) for all values of B and A in order to
conduct inference. In the above example, since Rain has an edge going into Wet Grass, it means
that P(WetGrass|Rain) will be a factor, whose probability values are specified next to the Wet
Grass node in a…
Use of Bayesian Network (BN) is to estimate the probability that the hypothesis is true based on
evidence.
With the help of this network, we can develop a comprehensive model that delineates the
relationship between the variables. It is used to answer probabilistic queries about them. We can
use it to observe the updated knowledge of the state of a subset of variables. For computing, the
posterior distribution of the variables with the given evidence is called probabilistic inference.
For detection applications, it gives universal statistics. When anyone wants to select values for
the variable subset, it minimizes some expected loss function, for instance, the probability of
decision error. A BN is a mechanism for applying Bayes’ theorem to complex problems.
Popular inference methods are:
Variable Elimination eliminates the non-observed non-query variables. It eliminates one by one
by distributing the sum over the product.
It caches the computation to query many variables at one time and also to propagate new
evidence.
Recursive conditioning allows a tradeoff between space and time. It is equivalent to the variable
elimination method if sufficient space is available.
2. Parameter Learning
To specify the BN and thus represent the joint probability distribution, it is necessary to
specify for each node X. Here, the probability distribution for the node X is conditional, based on
its parents. There can be many forms of distribution of X. Discrete or a Gaussian distribution
simplifies calculations. Sometimes constraints on distribution are only known. To determine a
single distribution, we can use the principle of maximum entropy. The only one who has the
greatest entropy is given the constraints.
3. Structure Learning
BN is specified by an expert and after that, it is used to perform inference. The task of defining
the network is too complex for humans in other applications. The parameters of the local
distributions and the network structure must learn from data in this case.
A challenge pursued that within machine learning is automatically learning the graph structure of
a BN. After that, the idea went back to an algorithm developed by Rebane and Pearl (1987).
The triplets allowed in a Directed Acyclic Graph (DAG):
X àY àZ
X ßYàZ
X àYßZ
X and Z are independent given Y. Represent the same dependencies by Type 1 and 2, so it is,
indistinguishable. We can uniquely identify Type 3. All other pairs are dependent and X and Z
are marginally independent. So, while the skeletons of these three triplets are identical, the
direction of the arrows is somehow identifiable. When X and Z have common parents, the same
distinction applies except that one condition on those parents. We develop the algorithm to
determine the skeleton of the underlying graph. After that orient, all arrows whose directionality
is estimated by the conditional independencies are observed.
Let’s consider two independent variables x 1, x2, and one dependent variable which is either a
blue circle or a red circle.
From the figure above it’s very clear that there are multiple lines (our hyperplane here
is a line because we are considering only two input features x 1, x2) that segregate our data
points or do a classification between red and blue circles. So how do we choose the best line or
in general the best hyperplane that segregates our data points?
One reasonable choice as the best hyper plane is the one that represents the largest
separation or margin between the two classes.
Multiple hyperplanes separate the data from two classes
So we choose the hyperplane whose distance from it to the nearest data point on each side is
maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a scenario
like shown below
Hype rplane which is the most optimized one.So in this type of data point what SVM
does is, finds the maximum margin as done with previous data sets along with that it adds a
penalty each time a point crosses the margin. So the margins in these types of cases are
called soft margins. When there is a soft margin to the data set, the SVM tries to
Till now, we were talking about linearly separable data(the group of blue balls and red balls are
separable by a straight line/linear line). What to do if data are not linearly separable?
Original 1D dataset for classification
Say, our data is shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point x i on the line and we create a new variable y i as a function of distance
from origin o.so if we plot this we get something like as shown below
In this case, the new variable y is created as a function of distance from the origin. A non-
linear function that creates a new variable is referred to as a kernel.
Types of Support Vector Machine
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be divided
into two main parts:
Linear SVM: Linear SVMs use a linear decision boundary to separate the data points of
different classes. When the data can be precisely linearly separated, linear SVMs are very
suitable. This means that a single straight line (in 2D) or a hyperplane (in higher
dimensions) can entirely divide the data points into their respective classes. A hyperplane
that maximizes the margin between the classes is the decision boundary.
Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel functions,
nonlinear SVMs can handle nonlinearly separable data. The original input data is
transformed by these kernel functions into a higher-dimensional feature space, where the
data points can be linearly separated. A linear SVM is used to locate a nonlinear decision
boundary in this modified space.
Kernel functions used in machine learning, including in SVMs (Support Vector Machines), have
several important characteristics, including:
o Mercer's condition: A kernel function must satisfy Mercer's condition to be valid. This
condition ensures that the kernel function is positive semi definite, which means that it is
always greater than or equal to zero.
o Positive definiteness: A kernel function is positive definite if it is always greater than
zero except for when the inputs are equal to each other.
o Non-negativity: A kernel function is non-negative, meaning that it produces non-
negative values for all inputs.
o Symmetry: A kernel function is symmetric, meaning that it produces the same value
regardless of the order in which the inputs are given.
o Reproducing property: A kernel function satisfies the reproducing property if it can be
used to reconstruct the input data in the feature space.
o Smoothness: A kernel function is said to be smooth if it produces a smooth
transformation of the input data into the feature space.
o Complexity: The complexity of a kernel function is an important consideration, as more
complex kernel functions may lead to over fitting and reduced generalization
performance.
In Support Vector Machines (SVMs), there are several types of kernel functions that can be used
to map the input data into a higher-dimensional feature space. The choice of kernel function
depends on the specific problem and the characteristics of the data.
Linear Kernel
A linear kernel is a type of kernel function used in machine learning, including in SVMs
(Support Vector Machines). It is the simplest and most commonly used kernel function, and it
defines the dot product between the input vectors in the original feature space.
Polynomial Kernel
The Gaussian kernel, also known as the radial basis function (RBF) kernel, is a popular kernel
function used in machine learning, particularly in SVMs (Support Vector Machines). It is a
nonlinear kernel function that maps the input data into a higher-dimensional feature space using
a Gaussian function.
Laplace Kernel
The Laplacian kernel, also known as the Laplace kernel or the exponential kernel, is a type of
kernel function used in machine learning, including in SVMs (Support Vector Machines). It is a
non-parametric kernel that can be used to measure the similarity or distance between two input
feature vectors.
At its core, time series analysis focuses on studying and interpreting a sequence of data points
recorded or collected at consistent time intervals. Unlike cross-sectional data, which captures a
snapshot in time, time series data is fundamentally dynamic, evolving over chronological
sequences both short and extremely long. This type of analysis is pivotal in uncovering
underlying structures within the data, such as trends, cycles, and seasonal variations.
Technically, time series analysis seeks to model the inherent structures within the data,
accounting for phenomena like autocorrelation, seasonal patterns, and trends. The order of data
points is crucial; rearranging them could lose meaningful insights or distort interpretations.
Furthermore, time series analysis often requires a substantial dataset to maintain the statistical
significance of the findings. This enables analysts to filter out 'noise,' ensuring that observed
patterns are not mere outliers but statistically significant trends or cycles.
Types of Data
When embarking on time series analysis, the first step is often understanding the type of data
you're working with. This categorization primarily falls into three distinct types: Time Series
Data, Cross-Sectional Data, and Pooled Data. Each type has unique features that guide the
subsequent analysis and modeling.
Time Series Data: Comprises observations collected at different time intervals. It's
geared towards analyzing trends, cycles, and other temporal patterns.
Cross-Sectional Data: Involves data points collected at a single moment in time. Useful
for understanding relationships or comparisons between different entities or categories at
that specific point.
Pooled Data: A combination of Time Series and Cross-Sectional data. This hybrid
enriches the dataset, allowing for more nuanced and comprehensive analyses.
Time series analysis is critical for businesses to predict future outcomes, assess past
performances, or identify underlying patterns and trends in various metrics. Time series analysis
can offer valuable insights into stock prices, sales figures, customer behavior, and other time-
dependent variables. By leveraging these techniques, businesses can make informed decisions,
optimize operations, and enhance long-term strategies.
Time series analysis offers a multitude of benefits to businesses.The applications are also wide-
ranging, whether it's in forecasting sales to manage inventory better, identifying the seasonality
in consumer behavior to plan marketing campaigns, or even analyzing financial markets for
investment strategies. Different techniques serve distinct purposes and offer varied granularity
and accuracy, making it vital for businesses to understand the methods that best suit their specific
needs.
Moving Average: Useful for smoothing out long-term trends. It is ideal for removing
noise and identifying the general direction in which values are moving.
Exponential Smoothing: Suited for univariate data with a systematic trend or seasonal
component. Assigns higher weight to recent observations, allowing for more dynamic
adjustments.
Decomposition: This breaks down a time series into its core components—trend,
seasonality, and residuals—to enhance the understanding and forecast accuracy.
Wavelet Analysis: Effective for analyzing non-stationary time series data. It helps in
identifying patterns across various scales or resolutions.
Intervention Analysis: Assesses the impact of external events on a time series, such as
the effect of a policy change or a marketing campaign.
Box-Jenkins ARIMA models: Focuses on using past behavior and errors to model time
series data. Assumes data can be characterized by a linear function of its past values.
Holt-Winters Exponential Smoothing: Best for data with a distinct trend and
seasonality. Incorporates weighted averages and builds upon the equations for
exponential smoothing.
Time series analysis is a powerful tool for data analysts that offers a variety of advantages for
both businesses and researchers. Its strengths include:
1. Data Cleansing: Time series analysis techniques such as smoothing and seasonality
adjustments help remove noise and outliers, making the data more reliable and
interpretable.
2. Understanding Data: Models like ARIMA or exponential smoothing provide insight
into the data's underlying structure. Autocorrelations and stationarity measures can help
understand the data's true nature.
3. Forecasting: One of the primary uses of time series analysis is to predict future values
based on historical data. Forecasting is invaluable for business planning, stock market
analysis, and other applications.
4. Identifying Trends and Seasonality: Time series analysis can uncover underlying
patterns, trends, and seasonality in data that might not be apparent through simple
observation.
5. Visualizations: Through time series decomposition and other techniques, it's possible to
create meaningful visualizations that clearly show trends, cycles, and irregularities in the
data.
6. Efficiency: With time series analysis, less data can sometimes be more. Focusing on
critical metrics and periods can often derive valuable insights without getting bogged
down in overly complex models or datasets.
7. Risk Assessment: Volatility and other risk factors can be modeled over time, aiding
financial and operational decision-making processes.
Linear systems analysis can also be applied in the context of data analytics, particularly in the
analysis of linear relationships between variables and in modeling the behavior of systems that
exhibit linear responses to input.
Here's how linear systems analysis concepts can be relevant in data analytics:
2. Time Series Analysis : Many time series data sets can be effectively modeled using linear
systems analysis techniques. For example, autoregressive (AR), moving average (MA), and
autoregressive integrated moving average (ARIMA) models are commonly used in time series
analysis to capture linear dependencies between observations over time.
3. Principal Component Analysis (PCA) : PCA is a dimensionality reduction technique that can
be viewed as a linear transformation of the data into a new coordinate system, such that the
greatest variance lies along the first coordinate (principal component), the second greatest
variance lies along the second coordinate, and so on. PCA is based on the analysis of the
covariance matrix of the data, which is a linear systems analysis concept.
4. Kalman Filtering : Kalman filters are used in data analytics for state estimation in systems
with linear dynamics and Gaussian noise. They are commonly applied in tracking applications,
sensor fusion, and signal processing tasks where there's a need to estimate the true state of a
system based on noisy measurements.
5. Linear Discriminant Analysis (LDA) : LDA is a technique used for dimensionality reduction
and classification. It seeks to find the linear combinations of features that best separate the
classes in the data. LDA is closely related to PCA but takes into account class labels when
finding the optimal feature space transformation.
6. Sparse Linear Models : Techniques like LASSO (Least Absolute Shrinkage and Selection
Operator) and ridge regression are used in data analytics for regression tasks where the number
of predictors is large compared to the number of observations. These techniques introduce
regularization to the linear regression model, encouraging sparsity in the coefficient estimates.
Rule induction is a machine-learning technique that involves the discovery of patterns or rules in
data. It aims to extract explicit if-then rules that can accurately predict or classify instances based
on their features or attributes.
Data Preparation: The input data is prepared by organizing it into a structured format, such as a
table or a matrix, where each row represents an instance or observation, and each column
represents a feature or attribute.
Rule Generation: The rule generation process involves finding patterns or associations in the
data that can be expressed as if-then rules. Various algorithms and methods can be used for rule
generation, such as decision tree algorithms (e.g., CART), association rule mining algorithms
(e.g., Apriori), and logical reasoning approaches (e.g., inductive logic programming).
Rule Evaluation: Once the rules are generated, they need to be evaluated to determine their
quality and usefulness. Evaluation metrics can include accuracy, coverage, support, confidence,
lift, and other measures depending on the specific application and domain.
Rule Selection and Pruning: Depending on the complexity of the rule set and the specific
requirements, rule selection and pruning techniques can be applied to refine the rule set. This
process involves removing redundant, irrelevant, or overlapping rules to improve interpretability
and efficiency.
Rule Application: Once a set of high-quality rules is obtained, they can be applied to new,
unseen instances for prediction or classification. Each instance is evaluated against the rules, and
the applicable rule(s) with the highest confidence or support is used to make predictions or
decisions.
Rule induction has been widely used in various domains, such as data mining, machine learning,
expert systems, and decision support systems. It provides interpretable and human-readable
models, making it useful for generating understandable insights and explanations from data.
While rule induction can be effective in capturing explicit patterns and associations in the data, it
may struggle with capturing complex or non-linear relationships. Additionally, rule induction
algorithms may face challenges when dealing with large and high-dimensional datasets, as the
search space of possible rules can become exponentially large.
Neural networks can help computers make intelligent decisions with limited human assistance.
This is because they can learn and model the relationships between input and output data that are
nonlinear and complex. For instance, they can do the following tasks.
Neural networks can comprehend unstructured data and make general observations without
explicit training. For instance, they can recognize that two different input sentences have a
similar meaning:
A neural network would know that both sentences mean the same thing. Or it would be able to
broadly recognize that Baxter Road is a place, but Baxter Smith is a person’s name.
Neural networks have several use cases across many industries, such as the following:
Computer vision
Computer vision is the ability of computers to extract information and insights from images and
videos. With neural networks, computers can distinguish and recognize images similar to
humans. Computer vision has several applications, such as the following:
Visual recognition in self-driving cars so they can recognize road signs and other road
users
Facial recognition to identify faces and recognize attributes like open eyes, glasses, and
facial hair
Image labeling to identify brand logos, clothing, safety gear, and other image details
Speech recognition
Neural networks can analyze human speech despite varying speech patterns, pitch, tone,
language, and accent. Virtual assistants like Amazon Alexa and automatic transcription software
use speech recognition to do tasks like these:
Accurately subtitle videos and meeting recordings for wider content reach
Natural language processing (NLP) is the ability to process natural, human-created text. Neural
networks help computers gather insights and meaning from text data and documents. NLP has
several use cases, including in these functions:
Recommendation engines
Neural networks can track user activity to develop personalized recommendations. They can also
analyze all user behavior and discover new products or services that interest a specific user. For
example, Curalate, a Philadelphia-based startup, helps brands convert social media posts into
sales. Brands use Curalate’s intelligent product tagging (IPT) service to automate the collection
and curation of user-generated social content. IPT uses neural networks to automatically find and
recommend products relevant to the user’s social media activity. Consumers don't have to hunt
through online catalogs to find a specific product from a social media image. Instead, they can
use Curalate’s auto product tagging to purchase the product with ease.
The human brain is the inspiration behind neural network architecture. Human brain cells, called
neurons, form a complex, highly interconnected network and send electrical signals to each other
to help humans process information. Similarly, an artificial neural network is made of artificial
neurons that work together to solve a problem. Artificial neurons are software modules, called
nodes, and artificial neural networks are software programs or algorithms that, at their core, use
computing systems to solve mathematical calculations.
Input Layer
Information from the outside world enters the artificial neural network from the input layer.
Input nodes process the data, analyze or categorize it, and pass it on to the next layer.
Hidden Layer
Hidden layers take their input from the input layer or other hidden layers. Artificial neural
networks can have a large number of hidden layers. Each hidden layer analyzes the output from
the previous layer, processes it further, and passes it on to the next layer.
Output Layer
The output layer gives the final result of all the data processing by the artificial neural network. It
can have single or multiple nodes. For instance, if we have a binary (yes/no) classification
problem, the output layer will have one output node, which will give the result as 1 or 0.
However, if we have a multi-class classification problem, the output layer might consist of more
than one output node.
Deep neural networks, or deep learning networks, have several hidden layers with millions of
artificial neurons linked together. A number, called weight, represents the connections between
one node and another. The weight is a positive number if one node excites another, or negative if
one node suppresses the other. Nodes with higher weight values have more influence on the
other nodes.
Theoretically, deep neural networks can map any input type to any output type. However, they
also need much more training as compared to other machine learning methods. They need
millions of examples of training data rather than perhaps the hundreds or thousands that a
simpler network might need.
Artificial neural networks can be categorized by how the data flows from the input node to the
output node. Below are some examples:
Feedforward neural networks process data in one direction, from the input node to the output
node. Every node in one layer is connected to every node in the next layer. A feedforward
network uses a feedback process to improve predictions over time.
Backpropagation algorithm
Artificial neural networks learn continuously by using corrective feedback loops to improve their
predictive analytics. In simple terms, you can think of the data flowing from the input node to the
output node through many different paths in the neural network. Only one path is the correct one
that maps the input node to the correct output node. To find this path, the neural network uses a
feedback loop, which works as follows:
1. Each node makes a guess about the next node in the path.
2. It checks if the guess was correct. Nodes assign higher weight values to paths that lead to
more correct guesses and lower weight values to node paths that lead to incorrect
guesses.
3. For the next data point, the nodes make a new prediction using the higher weight paths
and then repeat Step 1.
Convolutional neural networks
The hidden layers in convolutional neural networks perform specific mathematical functions,
like summarizing or filtering, called convolutions. They are very useful for image classification
because they can extract relevant features from images that are useful for image recognition and
classification. The new form is easier to process without losing features that are critical for
making a good prediction. Each hidden layer extracts and processes different image features, like
edges, color, and depth.
Whenever we train our own neural networks, we need to take care of something called
the generalization of the neural network. This essentially means how good our model is at
learning from the given data and applying the learnt information elsewhere.
When training a neural network, there’s going to be some data that the neural network trains
on, and there’s going to be some data reserved for checking the performance of the neural
network. If the neural network performs well on the data which it has not trained on, we can say
it has generalized well on the given data. Let’s understand this with an example.
Suppose we are training a neural network which should tell us if a given image has a dog or not.
Let’s assume we have several pictures of dogs, each dog belonging to a certain breed, and there
are 12 total breeds within those pictures. I’m going to keep all the images of 10 breeds of dogs
for training, and the remaining images of the 2 breeds will be kept aside for now.
Dogs training testing data split.
Now before going to the deep learning side of things, let’s look at this from a human perspective.
Let’s consider a human being who has never seen a dog in their entire life (just for the sake of an
example). Now we will show this human the 10 breeds of dogs and tell them that these are dogs.
After this, if we show them the other 2 breeds, will they be able to tell that they are also dogs?
Well hopefully they should, 10 breeds should be enough to understand and identify the unique
features of a dog. This concept of learning from some data and correctly applying the gained
knowledge on other data is called generalization.
Coming back to deep learning, our aim is to make the neural network learn as effectively from
the given data as possible. If we successfully make the neural network understand that the other
2 breeds are also dogs, then we have trained a very general neural network, and it will perform
really well in the real world.
What is Competitive Learning?
Competitive learning is a subset of machine learning that falls under the umbrella
of unsupervised learning algorithms. In competitive learning, a network of artificial neurons
competes to "fire" or become active in response to a specific input. The "winning" neuron, which
typically is the one that best matches the given input, is then updated while the others are left
unchanged. The significance of this learning method lies in its power to automatically cluster
similar data inputs, enabling us to find patterns and groupings in data where no prior knowledge
or labels are given.
Artificial neural networks often utilize competitive learning models to classify input
without the use of labeled data. The process begins with an input vector (often a data set). This
input is then presented to a network of artificial neurons, each of which has its own set of
weights, which act like filters. Each neuron computes a score based on its weight and the input
vector, typically through a dot product operation (a way of multiplying the input information
with the filter and adding the results together).
After the computation, the neuron that has the highest score (the "winner") is updated,
usually by shifting its weights closer to the input vector. This process is often referred to as the
"Winner-Takes-All" strategy. Over time, neurons become specialized as they get updated toward
input vectors they can best match. This leads to the formation of clusters of similar data, hence
enabling the discovery of inherent patterns within the input dataset.
To illustrate how one can use competitive learning, imagine an ecommerce business
wants to segment its customer base for targeted marketing, but they have no prior labels or
segmentation. By feeding customer data (purchase history, browsing pattern, demographics, etc.)
to a competitive learning model, they could automatically find distinct clusters (like high
spenders, frequent buyers, discount lovers) and tailor marketing strategies accordingly.
For this simple illustration, let's assume we have a dataset composed of 1-dimensional input
vectors ranging from 1 to 10 and a competitive learning network with two neurons.
Step 1: Initialization
We start by initializing the weights of the two neurons to random values. Let's assume:
Neuron 1 weight: 2
Neuron 2 weight: 8
Step 2: Presenting the input vector
Now, we present an input vector to the network. Let's say our input vector is '5'.
We calculate the distance between the input vector and the weights of the two neurons.
The neuron with the weight closest to the input vector 'wins.' This could be calculated using any
distance metric, for example, the absolute difference:
We adjust the winning neuron's weight to bring it closer to the input vector. If our
learning rate (a tuning parameter in an optimization algorithm that determines the step size at
each iteration) is 0.5, the weight update would be:
Step 5: Iteration
We repeat the process with all the other input vectors in the dataset, updating the weights after
each presentation.
Step 6: Convergence
After several iterations (also known as epochs), the neurons' weights will start to
converge to the centers of their corresponding input clusters. In this case, with 1-dimensional
data ranging from 1 to 10, we could expect one neuron to converge around the lower range (1 to
5) and the other around the higher range (6 to 10).
This process exemplifies how competitive learning works. Over time, each neuron
specializes in a different cluster of the data, enabling the system to identify and represent the
inherent groupings in the dataset.
What Is Principal Component Analysis?
Principal component analysis, or PCA, is a dimensionality reduction method that is often used to
reduce the dimensionality of large data sets, by transforming a large set of variables into a
smaller one that still contains most of the information in the large set.
Reducing the number of variables of a data set naturally comes at the expense of accuracy, but
the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller
data sets are easier to explore and visualize and make analyzing data points much easier and
faster for machine learning algorithms without extraneous variables to process.
So, to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while
preserving as much information as possible.
STEP 1: STANDARDIZATION
The aim of this step is to standardize the range of the continuous initial variables so that each one
of them contributes equally to the analysis.
More specifically, the reason why it is critical to perform standardization prior to PCA, is that the
latter is quite sensitive regarding the variances of the initial variables. That is, if there are large
differences between the ranges of initial variables, those variables with larger ranges will
dominate over those with small ranges (for example, a variable that ranges between 0 and 100
will dominate over a variable that ranges between 0 and 1), which will lead to biased results. So,
transforming the data to comparable scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and dividing by the standard deviation
for each value of each variable.
Once the standardization is done, all the variables will be transformed to the same scale.
The aim of this step is to understand how the variables of the input data set are varying from the
mean with respect to each other, or in other words, to see if there is any relationship between
them. Because sometimes, variables are highly correlated in such a way that they contain
redundant information. So, in order to identify these correlations, we compute the covariance
matrix.
The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that
has as entries the covariances associated with all possible pairs of the initial variables. For
example, for a 3-dimensional data set with 3 variables x, y, and z, the covariance matrix is a 3×3
data matrix of this from:
Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main
diagonal (Top left to bottom right) we actually have the variances of each initial variable. And
since the covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix
are symmetric with respect to the main diagonal, which means that the upper and the lower
triangular portions are equal.
Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the
covariance matrix in order to determine the principal components of the data. Before getting to
the explanation of these concepts, let’s first understand what do we mean by principal
components.
Principal components are new variables that are constructed as linear combinations or mixtures
of the initial variables. These combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information within the initial variables is
squeezed or compressed into the first components. So, the idea is 10-dimensional data gives you
10 principal components, but PCA tries to put maximum possible information in the first
component, then maximum remaining information in the second and so on, until having
something like shown in the scree plot below.
Organizing information in principal components this way, will allow you to reduce
dimensionality without losing much information, and this by discarding the components with
low information and considering the remaining components as your new variables.
An important thing to realize here is that the principal components are less interpretable and
don’t have any real meaning since they are constructed as linear combinations of the initial
variables.
STEP 4: FEATURE VECTOR
As we saw in the previous step, computing the eigenvectors and ordering them by their
eigenvalues in descending order, allow us to find the principal components in order of
significance. In this step, what we do is, to choose whether to keep all these components or
discard those of lesser significance (of low eigenvalues), and form with the remaining ones a
matrix of vectors that we call Feature vector.
So, the feature vector is simply a matrix that has as columns the eigenvectors of the components
that we decide to keep. This makes it the first step towards dimensionality reduction, because if
we choose to keep only p eigenvectors (components) out of n, the final data set will have
only p dimensions.
In the previous steps, apart from standardization, you do not make any changes on the data, you
just select the principal components and form the feature vector, but the input data set remains
always in terms of the original axes (i.e, in terms of the initial variables).
In this step, which is the last one, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones
represented by the principal components (hence the name Principal Components Analysis). This
can be done by multiplying the transpose of the original data set by the transpose of the feature
vector.
Fuzzy Logic introduction
The word fuzzy refers to things which are not clear or are vague. Any event, process, or
function that is changing continuously cannot always be defined as either true or false, which
means that we need to define such activities in a Fuzzy manner.
In other words, we can say that fuzzy logic is not logic that is fuzzy, but logic that is used
to describe fuzziness. There can be numerous other examples like this with the help of which we
can understand the concept of fuzzy logic. Fuzzy Logic was introduced in 1965 by Lofti A.
Zadeh in his research paper “Fuzzy Sets”. He is considered as the father of Fuzzy Logic.
Introduction:
Fuzzy-Logic theory has introduced a framework whereby human knowledge can be
formalized and used by machines in a wide variety of applications, ranging from cameras to
trains. The basic ideas that we discussed in the earlier posts were concerned with only this aspect
with regards to the use of Fuzzy Logic-based systems; that is the application of human experience
into machine-driven applications. While there are numerous instances where such techniques are
relevant; there are also applications where it is challenging for a human user to articulate the
knowledge that they hold. Such applications include driving a car or recognizing images. Machine
learning techniques provide an excellent platform in such circumstances, where sets of inputs and
corresponding outputs are available, building a model that provides the transformation from the
input data to the outputs using the available data.
Procedure
The objective of this exercise is, as we have explained in the introduction, given a set of
input/output combinations; we will generate a rule set that determines the mapping between the
inputs and outputs. In this discussion, we will consider a two-input, single-output system.
Extending this procedure for more complex systems should be a straightforward task to the
reader.
Step 1 — Divide the input and output spaces into fuzzy regions.
We start by assigning some fuzzy sets to each input and output space. Wang and Mendel
specified an odd number of evenly spaced fuzzy regions, determined by 2N+1 where N is an
integer. As we will see later on, the value of N affects the performance of our models and can
result in under/over fitting at times. N is, therefore, one of the hyper parameters that we will use to
tweak this system’s performance.
We then assign the region having the maximum degree of membership of to the spaces, which is
indicated by the highlighted elements in the above table so that it is possible to obtain a rule:
sample 1 => If x1 is b1 and x2 is s1 then y is ce => Rule 1
The next illustration shows a second example, together with the degree of membership results that
it generates.
This sample will, therefore, produce the following rule:
sample 2=> If x1 is b1 and x2 is ce then y is b1 => Rule 2
Step 3 — Assign a degree to each rule.
Step 2 is very straightforward to implement, yet it suffers from one problem; it will
generate conflicting rules, that is, rules that have the same antecedent clauses but different
consequent clauses. Wang and Medel solved this issue by assigning a degree to each rule, using a
product strategy such that the degree is the product of all the degree-of-membership values from
both antecedent and consequent spaces forming the rule. We retain the rule having the most
significant degree, while we discard the rules having the same antecedent but a having a smaller
degree.
If we refer to the previous example, the degree of Rule 1 will equate to:
We notice that this procedure reduces the number of rules radically in practice.
It is also possible to fuse human knowledge to the knowledge obtained from data by
introducing a human element to the rule degree that has high applicability in practice, as human
supervision can assess the reliability of data, and hence the rules generated from it directly. In the
cases where human intervention is not desirable, this factor is set to 1 for all rules. Rule 1 can be
hence defined as follows;
We now define the centre of a fuzzy region as the point that has the smallest absolute value
among all points at which the membership function for this region is equal to 1 as illustrated
below;
Traditional analytics collects data from heterogeneous data sources and we had to pull all
data together into a separate analytics environment to do analysis which can be an
analytical server or a personal computer with more computing capability.
The heavy processing occurs in the analytic environment as shown in figure.
In such environments, shipping of data becomes a must, which might result in issues
related with security of data and its confidentiality.
Data from heterogeneous sources are collected, transformed and loaded into data
warehouse for final analysis by decision makers.
The processing stays in the database where the data has been consolidated.
The data is presented in aggregated form for querying.
Queries from users are submitted to OLAP (online analytical processing) engines for
execution.
Such in-database architectures are tested for their query throughput rather than
transaction throughput as in traditional database environments.
More of metadata is required for directing the queries which helps in reducing the time
taken for answering queries and hence increase the query throughput.
Moreover, the data in consolidated form are free from anomalies, since they are pre-
processed before loading into warehouses which may be used directly for analysis.
Massive Parallel Processing (MPP) is the -shared nothing|| approach of parallel computing.
It is a type of computing wherein the process is being done by many CPUs working in
parallel to execute a single program. One of the most significant differences between
a Symmetric Multi-Processing or SMP and Massive Parallel Processing is that with
MPP, each of the many CPUs has its own memory to assist it in preventing a possible
hold up that the user may experience with using SMP when all of the CPUs attempt to
access the memory simultaneously.
The salient feature of MPP systems is:
5.On-demand self-service
7. Resource pooling
8.Rapid elasticity
9.Measured service
1. Public Cloud:
The services and infrastructure are provided off-site over the internet
2. Private Cloud:
Grid Computing:
* The various processors and local storage areas do not have high-speed connections.
Hadoop:
Map Reduce:
Hadoop MapReduce is a software framework for easily writing applications which process
vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of
nodes) of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner.
The framework sorts the outputs of the maps, which are then input to the reduce
tasks.
Typically, both the input and the output of the job are stored in a file-system.
The framework takes care of scheduling tasks, monitoring them and re-executes the
failed tasks.
HDFS:
Types Of Analytics:
1. Descriptive Analytics
2. Diagnostic Analytics
3. Predictive Analytics
4. Prescriptive Analytics