0% found this document useful (0 votes)
9 views

Data Analytics pdf

Uploaded by

sbsbxbxahsbxx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Analytics pdf

Uploaded by

sbsbxbxahsbxx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

Study Material

on

Data Analytics (KCS 051)

B. Tech. Semester -V

DR. A.P.J. ABDUL KALAM TECHNICAL UNIVERSITY,


UTTAR PRADESH, LUCKNOW
Unit-1
Introduction to Data Analytics
Different Sources of Data for Data Analysis
Data collection is the process of acquiring, collecting, extracting, and storing the voluminous
amount of data which may be in the structured or unstructured form like text, video, audio, XML
files, records, or other image files used in later stages of data analysis.
In the process of big data analysis, “Data collection” is the initial step before starting to analyze the
patterns or useful information in data. The data which is to be analyzed must be collected from
different valid sources.

Data Growth over the years

The data which is collected is known as raw data which is not useful now but on cleaning the
impure and utilizing that data for further analysis forms information, the information obtained is
known as “knowledge”. Knowledge has many meanings like business knowledge or sales of
enterprise products, disease treatment, etc. The main goal of data collection is to collect
information-rich data.

Data collection starts with asking some questions such as what type of data is to be collected and
what is the source of collection. Most of the data collected are of two types known as “qualitative
data“ which is a group of non-numerical data such as words, sentences mostly focus on behavior
and actions of the group and another one is “quantitative data” which is in numerical forms and can
be calculated using different scientific tools and sampling data.

The actual data is then further divided mainly into two types known as:

 Primary data

 Secondary data
1. Primary Data:

The data which is Raw, original, and extracted directly from the official sources is known as
primary data. This type of data is collected directly by performing techniques such as
questionnaires, interviews, and surveys. The data collected must be according to the demand and
requirements of the target audience on which analysis is performed otherwise it would be a burden
in the data processing.

Few methods of collecting primary data:

a. Interview method:

The data collected during this process is through interviewing the target audience by a person called
interviewer and the person who answers the interview is known as the interviewee. Some basic
business or product related questions are asked and noted down in the form of notes, audio, or video
and this data is stored for processing. These can be both structured and unstructured like personal
interviews or formal interviews through telephone, face to face, email, etc.

b. Survey method:

The survey method is the process of research where a list of relevant questions are asked and
answers are noted down in the form of text, audio, or video. The survey method can be obtained in
both online and offline mode like through website forms and email. Then that survey answers are
stored for analyzing data. Examples are online surveys or surveys through social media polls.
c. Observation method:

The observation method is a method of data collection in which the researcher keenly observes the
behavior and practices of the target audience using some data collecting tool and stores the observed
data in the form of text, audio, video, or any raw formats. In this method, the data is collected
directly by posting a few questions on the participants. For example, observing a group of
customers and their behavior towards the products. The data obtained will be sent for processing.

d. Experimental method:

The experimental method is the process of collecting data through performing experiments,
research, and investigation. The most frequently used experiment methods are CRD, RBD,
LSD, FD.

 CRD- Completely Randomized design is a simple experimental design used in data


analytics which is based on randomization and replication. It is mostly used for comparing
the experiments.
 RBD- Randomized Block Design is an experimental design in which the experiment is
divided into small units called blocks. Random experiments are performed on each of the
blocks and results are drawn using a technique known as analysis of variance (ANOVA).
RBD was originated from the agriculture sector.
 LSD – Latin Square Design is an experimental design that is similar to CRD and RBD
blocks but contains rows and columns. It is an arrangement of NxN squares with an equal
amount of rows and columns which contain letters that occurs only once in a row. Hence the
differences can be easily found with fewer errors in the experiment. Sudoku puzzle is an
example of a Latin square design.
 FD- Factorial design is an experimental design where each experiment has two factors each
with possible values and on performing trail other combinational factors are derived.

2. Secondary data:

Secondary data is the data which has already been collected and reused again for some valid
purpose. This type of data is previously recorded from primary data and it has two types of sources
named internal source and external source.

Internal source:

These types of data can easily be found within the organization such as market record, a sales
record, transactions, customer data, accounting resources, etc. The cost and time consumption is
less in obtaining internal sources.

External source:

The data which can’t be found at internal organizations and can be gained through external third
party resources is external source data. The cost and time consumption is more because this
contains a huge amount of data. Examples of external sources are Government publications, news
publications, Registrar General of India, planning commission, international labor bureau, syndicate
services, and other non-governmental publications.

Other sources:

 Sensors data: With the advancement of IoT devices, the sensors of these devices collect
data which can be used for sensor data analytics to track the performance and usage of
products.
 Satellites data: Satellites collect a lot of images and data in terabytes on daily basis through
surveillance cameras which can be used to collect useful information.
 Web traffic: Due to fast and cheap internet facilities many formats of data which is
uploaded by users on different platforms can be predicted and collected with their
permission for data analysis. The search engines also provide their data through keywords
and queries searched mostly.
Classification of data (structured, semi-structured,
unstructured)

Big Data includes huge volume, high velocity, and extensible variety of data. These are 3 types:
Structured data, Semi-structured data, and Unstructured data.

1. Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example: Relational data.
Examples Of Structured Data
An 'Employee' table in a database is an example of Structured Data

2.
2.
2.
2.
2.
Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that have
some organizational properties that make it easier to analyze. With some process, you can
store them in the relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.
Examples Of Semi-structured Data
Personal data stored in an XML file-
3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence and
analytics applications. Example: Word, PDF, Text, Media logs.

Examples Of Un-structured Data

Differences between Structured, Semi-structured and Unstructured data:


Characterstics of Data
The seven characteristics that define data quality are:

1. Accuracy and Precision


2. Legitimacy and Validity
3. Reliability and Consistency
4. Timeliness and Relevance
5. Completeness and Comprehensiveness
6. Availability and Accessibility
7. Granularity and Uniqueness

Accuracy and Precision: This characteristic refers to the exactness of the data. It cannot have any
erroneous elements and must convey the correct message without being misleading. This accuracy
and precision have a component that relates to its intended use. Without understanding how the data
will be consumed, ensuring accuracy and precision could be off-target or more costly than
necessary. For example, accuracy in healthcare might be more important than in another industry
(which is to say, inaccurate data in healthcare could have more serious consequences) and,
therefore, justifiably worth higher levels of investment.

Legitimacy and Validity: Requirements governing data set the boundaries of this characteristic.
For example, on surveys, items such as gender, ethnicity, and nationality are typically limited to a
set of options and open answers are not permitted. Any answers other than these would not be
considered valid or legitimate based on the survey’s requirement. This is the case for most data and
must be carefully considered when determining its quality. The people in each department in an
organization understand what data is valid or not to them, so the requirements must be leveraged
when evaluating data quality.

Reliability and Consistency: Many systems in today’s environments use and/or collect the same
source data. Regardless of what source collected the data or where it resides, it cannot contradict a
value residing in a different source or collected by a different system. There must be a stable and
steady mechanism that collects and stores the data without contradiction or unwarranted variance.
Timeliness and Relevance: There must be a valid reason to collect the data to justify the effort
required, which also means it has to be collected at the right moment in time. Data collected too
soon or too late could misrepresent a situation and drive inaccurate decisions.

Completeness and Comprehensiveness: Incomplete data is as dangerous as inaccurate data. Gaps


in data collection lead to a partial view of the overall picture to be displayed. Without a complete
picture of how operations are running, uninformed actions will occur. It’s important to understand
the complete set of requirements that constitute a comprehensive set of data to determine whether or
not the requirements are being fulfilled.

Availability and Accessibility: This characteristic can be tricky at times due to legal and regulatory
constraints. Regardless of the challenge, though, individuals need the right level of access to the
data in order to perform their jobs. This presumes that the data exists and is available for access to
be granted.

Granularity and Uniqueness: The level of detail at which data is collected is important, because
confusion and inaccurate decisions can otherwise occur. Aggregated, summarized and manipulated
collections of data could offer a different meaning than the data implied at a lower level. An
appropriate level of granularity must be defined to provide sufficient uniqueness and distinctive
properties to become visible. This is a requirement for operations to function effectively.

Introduction to Big Data platform

Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can store it
or process it efficiently. Big data is also a data but with huge size.

Examples Of Big Data


1. Stock Exchange: The New York Stock Exchange generates about one terabyte of new trade data
per day.
2. Social Media: The statistic shows that 500+terabytes of new data get ingested into the databases
of social media site Facebook, every day. This data is mainly generated in terms of photo and video
uploads, message exchanges, putting comments etc.
3. Jet Engine: A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.
With many thousand flights per day, generation of data reaches up to many Petabytes.

Characteristics Of Big Data

Big data can be described by the following characteristics:

 Volume
 Variety
 Velocity
 Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
very crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, 'Volume' is one
characteristic which needs to be considered while dealing with Big Data.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most of
the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs,
audio, etc. are also being considered in the analysis applications. This variety of unstructured data
poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
What Is Data Analytics?
Data analytics is the science of analyzing raw data in order to make conclusions about that
information. Many of the techniques and processes of data analytics have been automated into
mechanical processes and algorithms that work over raw data for human consumption

Data analytics techniques can reveal trends and metrics that would otherwise be lost in the mass of
information. This information can then be used to optimize processes to increase the overall
efficiency of a business or system.

The process involved in data analysis involves several different steps:

1. The first step is to determine the data requirements or how the data is grouped. Data may be
separated by age, demographic, income, or gender. Data values may be numerical or be
divided by category.

2. The second step in data analytics is the process of collecting it. This can be done through a
variety of sources such as computers, online sources, cameras, environmental sources, or
through personnel.

3. Once the data is collected, it must be organized so it can be analyzed. Organization may take
place on a spreadsheet or other form of software that can take statistical data.

4. The data is then cleaned up before analysis. This means it is scrubbed and checked to ensure
there is no duplication or error, and that it is not incomplete. This step helps correct any
errors before it goes on to a data analyst to be analyzed.

Why Data Analytics Matters

1) Data analytics is important because it helps businesses optimize their performances.


Implementing it into the business model means companies can help reduce costs by
identifying more efficient ways of doing business and by storing large amounts of data.
2) A company can also use data analytics to make better business decisions and help analyze
customer trends and satisfaction, which can lead to new—and better—products and
services.
3) Data analytics help a business optimize its performance.
The Evolution of Analytic Scalability
It goes without saying that the world of big data requires new levels of scalability. As the amount of
data organizations process continues to increase, the same old methods for handling data just won’t
work anymore. Organizations that don’t update their technologies to provide a higher level of
scalability will quite simply choke on big data. Luckily, there are multiple technologies available
that address different aspects of the process of taming big data and making use of it in analytic
processes. Some of these advances are quite new, and organizations need to keep up with the times.
Measument of Data Size

As the decades have passed, data has moved far beyond the scale that people can handle manually.
The amount of data has grown at least as fast as the computing power of the machines that process
it. It may not be necessary to personally break a sweat and get a head-ache computing things by
hand, but it is still very easy to cause com-puter and storage systems to start steaming as they
struggle to process the data fed to the.
Traditional Analytic Architecture

Modern Database Architecture

MASSIVELY PARALLEL PROCESSING SYSTEMS


Massively parallel processing (MPP) database systems have been around for decades. While
individual vendor architectures may vary, MPP is the most mature, proven, and widely deployed
mechanism for storing and analyzing large amounts of data.
An MPP database spreads data out into independent pieces managed by independent storage and
central processing unit (CPU) resources. Conceptually, it is like having pieces of data loaded onto
multiple network connected personal computers around a house. It removes the constraints of
having one central server with only a single set CPU and disk to manage it. The data in an MPP
system gets split across a variety of disks managed by a variety of CPUs spread across a number of
servers.
CLOUD COMPUTING
1.Enterprises incur no infrastructure or capital costs, only opera-tional costs. Those operational
costs will be incurred on a pay - per - use basis with no contractual obligations.
2. Capacity can be scaled up or down dynamically, and immedi-ately. This differentiates clouds
from traditional hosting service providers where there may have been limits placed on scaling.
3. The underlying hardware can be anywhere geographically. The architectural specifics are
abstracted from the user. In addition, the hardware will run in multi - tenancy mode where multiple
users from multiple organizations can be accessing the exact same infrastructure simultaneously.
GRID COMPUTING
A grid confi guration can help both cost and performance. It falls into the classifi cation of “ high -
performance computing. ” Instead of having a single high - end server (or maybe a few of them), a
large number of lower - cost machines are put in place. As opposed to having one server managing
its CPU and resources across jobs, jobs are par-celed out individually to the different machines to be
processed in parallel. Each machine may only be able to handle a fraction of the work of the
original server and can potentially handle only one job at a time. In aggregate, however, the grid can
handle quite a bit. Grids can therefore be a cost - effective mechanism to improve overall
throughput and capacity. Grid computing also helps organizations balance workloads, prioritize
jobs, and offer high availability for ana-lytic processing

MAPREDUCE
MapReduce is a parallel programming framework. It ’ s neither a data-base nor a direct competitor
to databases. This has not stopped some people from claiming it ’ s going to replace databases and
everything else under the sun. The reality is MapReduce is complementary to existing technologies.
There are a lot of tasks that can be done in a MapReduce environment that can also be done in a
relational database. What it comes down to is identifying which environment is better for the
problem at hand. Being able to do something with a tool or technology isn ’ t the same as being the
best way to do something. By focusing on what MapReduce is best for instead of what theoretically
can be done with it, it is possible to maximize the benefi ts received.

Analytic process and tools,

Data Analysis Process consists of the following phases that are iterative in nature −

 Data Requirements Specification


 Data Collection
 Data Processing
 Data Cleaning
 Data Analysis
 Communication

Best Analytic Processes and Big Data Tools

Big data is the storage and analysis of large data sets. These are complex data sets which can be
both structured or unstructured. They are so large that it is not possible to work on them with
traditional analytical tools. These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring more efficiency in their
work environment.
There are many big data tools and processes being utilised by companies these days. These are used
in the processes of discovering insights and supporting decision making. The top big data tools used
these days are open source data tools, data visualization tools, sentiment tools, data extraction tools
and databases. Some of the best used big data tools are mentioned below –

1. R-Programming

R is a free open source software programming language and a software environment for statistical
computing and graphics. It is used by data miners for developing statistical software and data
analysis. It has become a highly popular tool for big data in recent years.

2. Datawrapper

It is an online data visualization tool for making interactive charts. You need to paste your data file
in a csv, pdf or excel format or paste it directly in the field. Datawrapper then generates any
visualization in the form of bar, line, map etc. It can be embedded into any other website as well. It
is easy to use and produces visually effective charts.

3. Tableau Public

Tableau is another popular big data tool. It is simple and very intuitive to use. It communicates the
insights of the data through data visualisation. Through Tableau, an analyst can check a hypothesis
and explore the data before starting to work on it extensively.

4. Content Grabber
Content Grabber is a data extraction tool. It is suitable for people with advanced programming
skills. It is a web crawling software. Businesses can use it to extract content and save it in a
structured format. It offers editing and debugging facility among many others for analysis later.

Analysis vs reporting
“Analytics” means raw data analysis. Typical analytics requests usually imply a once-off data
investigation., whereas, “Reporting” means data to inform decisions. Typical reporting requests
usually imply repeatable access to the information, which could be monthly, weekly, daily, or even
real-time.

Some of the steps involved within a data analytics exploration:

 create data hypothesis


 gather and manipulate data
 present results to the business
 re-iterate

Some of the steps involved in building a report:


Understand business requirement

 Connect and gather the data


 Translate the technical data
 Understand the data backgrounds by different dimensions
 Find a way to display data for 100 categories and its 5 sub-categories (500+ combinations!)
 Re-work the data
 Business stakeholder gets confused
 Scope gets changed
 Repeat the steps
 More re-work
 Initial visualisation on excel
 Addressing stakeholders understanding
 Start the reporting dashboard build
 Configure the features and parameters
 More re-work
 Test the user experience
 Conform with the company style guide
 Test the reporting automation and deployment
 Liaise with technology or production team
 Set up a process for regular refresh and failure
 Document reporting process

Modern Data Analytics Tools


1. R Programming: R is the leading analytics tool in the industry and widely used for statistics and
data modeling. It can easily manipulate your data and present in different ways. It has exceeded
SAS in many ways like capacity of data, performance and outcome. R compiles and runs on a wide
variety of platforms viz -UNIX, Windows and MacOS. It has 11,556 packages and allows you to
browse the packages by categories. R also provides tools to automatically install all packages as per
user requirement, which can also be well assembled with Big data.
2. Tableau Public: Tableau Public is a free software that connects any data source be it corporate
Data Warehouse, Microsoft Excel or web-based data, and creates data visualizations, maps,
dashboards etc. with real-time updates presenting on web. They can also be shared through social
media or with the client. It allows the access to download the file in different formats.

3. Python: Python is an object-oriented scripting language which is easy to read, write, maintain
and is a free open source tool. It was developed by Guido van Rossum in late 1980’s which supports
both functional and structured programming methods.
Phython is easy to learn as it is very similar to JavaScript, Ruby, and PHP. Also, Python has very
good machine learning libraries viz. Scikitlearn, Theano, Tensorflow and Keras. Another important
feature of Python is that it can be assembled on any platform like SQL server, a MongoDB database
or JSON. Python can also handle text data very well.

4. SAS: Sas is a programming environment and language for data manipulation and a leader in
analytics, developed by the SAS Institute in 1966 and further developed in 1980’s and 1990’s. SAS
is easily accessible, managable and can analyze data from any sources. SAS introduced a large set
of products in 2011 for customer intelligence and numerous SAS modules for web, social media
and marketing analytics that is widely used for profiling customers and prospects. It can also predict
their behaviors, manage, and optimize communications.

5.Apache Spark: Apache Spark is a fast large-scale data processing engine and executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk. Spark is
built on data science and its concept makes data science effortless. Spark is also popular for data
pipelines and machine learning models development.
Spark also includes a library – MLlib, that provides a progressive set of machine algorithms for
repetitive data science techniques like Classification, Regression, Collaborative Filtering,
Clustering, etc.

6. Excel: Excel is a basic, popular and widely used analytical tool almost in all industries. Whether
you are an expert in Sas, R or Tableau, you will still need to use Excel. Excel becomes important
when there is a requirement of analytics on the client’s internal data. It analyzes the complex task
that summarizes the data with a preview of pivot tables that helps in filtering the data as per client
requirement. Excel has the advance business analytics option which helps in modelling capabilities
which have prebuilt options like automatic relationship detection, a creation of DAX measures and
time grouping.
7. RapidMiner: RapidMiner is a powerful integrated data science platform developed by the same
company that performs predictive analysis and other advanced analytics like data mining, text
analytics, machine learning and visual analytics without any programming. RapidMiner can
incorporate with any data source types, including Access, Excel, Microsoft SQL, Tera data, Oracle,
Sybase, IBM DB2, Ingres, MySQL, IBM SPSS, Dbase etc. The tool is very powerful that can
generate analytics based on real-life data transformation settings.
Applications of data analytics

1. Security
Data analytics applications or, more specifically, predictive analysis has also helped in dropping
crime rates in certain areas. In a few major cities like Los Angeles and Chicago, historical and
geographical data has been used to isolate specific areas where crime rates could surge. On that
basis, while arrests could not be made on a whim, police patrols could be increased. Thus, using
applications of data analytics, crime rates dropped in these areas.

2. Transportation

Data analytics can be used to revolutionize transportation. It can be used especially in areas where
you need to transport a large number of people to a specific area and require seamless
transportation. This data analytical technique was applied in the London Olympics a few years ago.

For this event, around 18 million journeys had to be made. So, the train operators and TFL were
able to use data from similar events, predict the number of people who would travel, and then
ensure that the transportation was kept smooth.

3. Risk detection

One of the first data analytics applications may have been in the discovery of fraud. Many
organizations were struggling under debt, and they wanted a solution to this problem. They already
had enough customer data in their hands, and so, they applied data analytics. They used ‘divide and
conquer’ policy with the data, analyzing recent expenditure, profiles, and any other important
information to understand any probability of a customer defaulting. Eventually, it led to lower risks
and fraud.

4. Risk Management

Risk management is an essential aspect in the world of insurance. While a person is being insured,
there is a lot of data analytics that goes on during the process. The risk involved while insuring the
person is based on several data like actuarial data and claims data, and the analysis of them helps
insurance companies to realize the risk.
5. Delivery
Several top logistic companies like DHL and FedEx are using data analysis to examine collected
data and improve their overall efficiency. Using data analytics applications, the companies were
able to find the best shipping routes, delivery time, as well as the most cost-efficient transport
means. Using GPS and accumulating data from the GPS gives them a huge advantage in data
analytics.

6. Fast internet allocation

While it might seem that allocating fast internet in every area makes a city ‘Smart’, in reality, it is
more important to engage in smart allocation. This smart allocation would mean understanding how
bandwidth is being used in specific areas and for the right cause.

7. Reasonable Expenditure

When one is building Smart cities, it becomes difficult to plan it out in the right way. Remodelling
of the landmark or making any change would incur large amounts of expenditure, which might
eventually turn out to be a waste.

8. Interaction with customers

In insurance, there should be a healthy relationship between the claims handlers and customers.
Hence, to improve their services, many insurance companies often use customer surveys to collect
data. Since insurance companies target a diverse group of people, each demographic has their own
preference when it comes to communication.

9. Planning of cities

One of the untapped disciplines where data analysis can really grow is city planning. While many
city planners might be hesitant towards using data analysis in their favour, it only results in faulty
cities riddled congestion. Using data analysis would help in bettering accessibility and minimizing
overloading in the city.

10. Healthcare

While medicine has come a long way since ancient times and is ever-improving, it remains a costly
affair. Many hospitals are struggling with the cost pressures that modern healthcare has come with,
which includes the use of sophisticated machinery, medicines, etc.

But now, with the help of data analytics applications, healthcare facilities can track the treatment of
patients and patient flow as well as how equipment are being used in hospitals.
Need of Data Analytics Life Cycle

Data analytics is important because it helps businesses optimize their performances. ... A company
can also use data analytics to make better business decisions and help analyze customer trends and
satisfaction, which can lead to new—and better—products and services

Key Roles for a Data analytics project :

1. Business User :
 The business user is the one who understands the main area of the project and is also
basically benefited from the results.
 This user gives advice and consult the team working on the project about the value of
the results obtained and how the operations on the outputs are done.
 The business manager, line manager, or deep subject matter expert in the project
mains fulfills this role.

2. Project Sponsor :
 The Project Sponsor is the one who is responsible to initiate the project. Project
Sponsor provides the actual requirements for the project and presents the basic
business issue.
 He generally provides the funds and measures the degree of value from the final
output of the team working on the project.
 This person introduce the prime concern and brooms the desired output.

3. Project Manager :

 This person ensures that key milestone and purpose of the project is met on time and of
the expected quality.

4. Business Intelligence Analyst :

 Business Intelligence Analyst provides business domain perfection based on a detailed


and deep understanding of the data, key performance indicators (KPIs), key matrix, and
business intelligence from a reporting point of view.
 This person generally creates fascia and reports and knows about the data feeds and
sources.

5. Database Administrator (DBA) :

 DBA facilitates and arrange the database environment to support the analytics need of
the team working on a project.
6. Data Engineer :

 Data engineer grasps deep technical skills to assist with tuning SQL queries for data
management and data extraction and provides support for data intake into the analytic
sandbox.
 The data engineer works jointly with the data scientist to help build data in correct ways
for analysis.

7. Data Scientist :

 Data scientist facilitates with the subject matter expertise for analytical techniques, data
modelling, and applying correct analytical techniques for a given business issues.
 He ensures overall analytical objectives are met.
 Data scientists outline and apply analytical methods and proceed towards the data
available for the concerned project.

Data Analytics Lifecycle :


The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle
is iterative to represent real project. To address the distinct requirements for performing analysis on
Big Data, step – by – step methodology is needed to organize the activities and tasks involved with
acquiring, processing, analyzing, and repurposing data.

Phase 1: Discovery –
 The data science team learn and investigate the problem.
 Develop context and understanding.
 Come to know about data sources needed and available for the project.
 The team formulates initial hypothesis that can be later tested with data.

Phase 2: Data Preparation –


 Steps to explore, preprocess, and condition data prior to modeling and analysis.
 It requires the presence of an analytic sandbox, the team execute, load, and transform, to get
data into the sandbox.
 Data preparation tasks are likely to be performed multiple times and not in predefined order.
 Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.

Phase 3: Model Planning –

Team explores data to learn about relationships between variables and subsequently, selects
key variables and the most suitable models.

 In this phase, data science team develop data sets for training, testing, and production
purposes.
 Team builds and executes models based on the work done in the model planning phase.
 Several tools commonly used for this phase are – Matlab, STASTICA.

Phase 4: Model Building –


 Team develops datasets for testing, training, and production purposes.
 Team also considers whether its existing tools will suffice for running the models or if they
need more robust environment for executing models.
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.

Phase 5: Communication Results –

 After executing model team need to compare outcomes of modeling to criteria established
for success and failure.
 Team considers how best to articulate findings and outcomes to various team members and
stakeholders, taking into account warning, assumptions.
 Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.

Phase 6: Operationalize –
 The team communicates benefits of project more broadly and sets up pilot project to deploy
work in controlled way before broadening the work to full enterprise of users.
 This approach enables team to learn about performance and related constraints of the model
in production environment on small scale &nbsp, and make adjustments before full
deployment.
 The team delivers final reports, briefings, codes.
 Free or open source tools – Octave, WEKA, SQL, MADlib.
Unit-II
Data Analysis
Data Analysis: Data analysis is defined as a process of cleaning, transforming, and modeling data
to discover useful information for business decision-making. The purpose of Data Analysis is to
extract useful information from data and taking the decision based upon the data analysis.
Regression Modeling: Regression is a method to mathematically formulate relationship between
variables that in due course can be used to estimate, interpolate and extrapolate. Suppose we want to
estimate the weight of individuals, which is influenced by height, diet, workout, etc. Here, Weight is
the predicted variable. Height, Diet, Workout are predictor variables.
The predicted variable is a dependant variable in the sense that it depends on predictors. Predictors
are also called as independent variables. Regression reveals to what extent the predicted variable is
affected by the predictors. In other words, what amount of variation in predictors will result in
variations of the predicted variable. The predicted variable is mathematically represented as Y. The
predictor variables are represented as X1, X2, X3, etc. This mathematical relationship is often
called the regression model.

Regression models are widely used in analytics, in general being among the most easy to
understand and interpret type of analytics techniques. Regression techniques allow the identification
and estimation of possible relationships between a pattern or variable of interest, and factors that
influence that pattern. For example, a company may be interested in understanding the
effectiveness of its marketing strategies. It may deploy a variety of marketing activities in a given
time period, perhaps TV advertising, and print advertising, social media campaigns, radio
advertising and so on. A regression model can be used to understand and quantify which of its
marketing activities actually drive sales, and to what extent. The advantage of regression over
simple correlations is that it allows you to control for the simultaneous impact of multiple other
factors that influence your variable of interest, or the “target” variable. That is, in this example,
things like pricing changes or competitive activities also influence sales of the brand of interest, and
the regession model allows you to account for the impacts of these factors when you estimate the
true impact of say each type of marketing activity on sales.

Multivariate analysis: It is a set of statistical techniques used for analysis of data that contain
more than one variable. Multivariate analysis is typically used for:

 Quality control and quality assurance


 Process optimisation and process control
 Research and development
 Consumer and market research

How multivariate methods are used

 Obtain a summary or an overview of a table. This analysis is often called Principal


Components Analysis or Factor Analysis. In the overview, it is possible to identify the
dominant patterns in the data, such as groups, outliers, trends, and so on. The patterns are
displayed as two plots
 Analyse groups in the table, how these groups differ, and to which group individual table
rows belong. This type of analysis is called Classification and Discriminant Analysis
 Find relationships between columns in data tables, for instance relationships between
process operation conditions and product quality. The objective is to use one set of variables
(columns) to predict another, for the purpose of optimization, and to find out which columns
are important in the relationship. The corresponding analysis is called Multiple Regression
Analysis or Partial Least Squares (PLS), depending on the size of the data tab

Bayesian modeling: Bayesian analysis is a statistical paradigm that answers research questions
about unknown parameters using probability statements. For example,

 What is the probability that the average male height is between 70 and 80 inches or that the
average female height is between 60 and 70 inches?

 What is the probability that people in a particular state vote Republican or vote
Democratic?

 What is the probability that a person accused of a crime is guilty?

 What is the probability that treatment A is more cost effective than treatment B for a
specific health care provider?

 What is the probability that a patient's blood pressure decreases if he or she is prescribed
drug A?

 What is the probability that the odds ratio is between 0.3 and 0.5?

 What is the probability that three out of five quiz questions will be answered correctly by
students?

 What is the probability that children with ADHD underperform relative to other children
on a standardized test?
 What is the probability that there is a positive effect of schooling on wage?

 What is the probability that excess returns on an asset are positive?

Such probabilistic statements are natural to Bayesian analysis because of the underlying assumption
that all parameters are random quantities. In Bayesian analysis, a parameter is summarized by an
entire distribution of values instead of one fixed value as in classical frequentist analysis.
Estimating this distribution, a posterior distribution of a parameter of interest, is at the heart of
Bayesian analysis.

Posterior Distribution: A posterior distribution comprises a prior distribution about a parameter


and a likelihood model providing information about the parameter based on observed data.
Depending on the chosen prior distribution and likelihood model, the posterior distribution is either
available analytically or approximated by, for example, one of the Markov chain Monte Carlo
(MCMC) methods.

Bayesian inference: It uses the posterior distribution to form various summaries for the model
parameters, including point estimates such as posterior means, medians, percentiles, and interval
estimates known as credible intervals. Moreover, all statistical tests about model parameters can be
expressed as probability statements based on the estimated posterior distribution.
Bayesian Network: These are a type of probabilistic graphical model that uses Bayesian inference
for probability computations. Bayesian networks aim to model conditional dependence, and
therefore causation, by representing conditional dependence by edges in a directed graph. Through
these relationships, one can efficiently conduct inference on the random variables in the graph
through the use of factors.

Using the relationships specified by our Bayesian network, we can obtain a compact, factorized
representation of the joint probability distribution by taking advantage of conditional independence.

A bayesian neywork is a directed acyclic graph, in which each edge corresponds to a conditional
depenedency and each node corresponds to a unique random varaible. Formally, if an edge (A, B)
exists in the graph connecting random variables A and B, it means that P(B|A) is a factor in the
joint probability distribution, so we must know P(B|A) for all values of B and A in order to conduct
inference. In the above example, since Rain has an edge going into WetGrass, it means that
P(WetGrass|Rain) will be a factor, whose probability values are specified next to the WetGrass node
in a conditional probability table.
Bayesian networks satisfy the local Markov property, which states that a node is conditionally
independent of its non-descendants given its parents. In the above example, this means that
P(Sprinkler|Cloudy, Rain) = P(Sprinkler|Cloudy) since Sprinkler is conditionally independent of its
non-descendant, Rain, given Cloudy. This property allows us to simplify the joint distribution,
obtained in the previous section using the chain rule, to a smaller form. After simplification, the
joint distribution for a Bayesian network is equal to the product of P(node|parents(node)) for all
nodes,

Inference

Inference over a Bayesian network can come in two forms.

The first is simply evaluating the joint probability of a particular assignment of values for each
variable (or a subset) in the network. For this, we already have a factorized form of the joint
distribution, so we simply evaluate that product using the provided conditional probabilities. If we
only care about a subset of variables, we will need to marginalize out the ones we are not interested
in. In many cases, this may result in underflow, so it is common to take the logarithm of that
product, which is equivalent to adding up the individual logarithms of each term in the product.

The second, more interesting inference task, is to find P(x|e), or, to find the probability of some
assignment of a subset of the variables (x) given assignments of other variables (our evidence, e). In
the above example, an example of this could be to find P(Sprinkler, WetGrass | Cloudy), where
{Sprinkler, WetGrass} is our x, and {Cloudy} is our e. In order to calculate this, we use the fact that
P(x|e) = P(x, e) / P(e) = αP(x, e), where α is a normalization constant that we will calculate at the
end such that P(x|e) + P(¬x | e) = 1. In order to calculate P(x, e), we must marginalize the joint
probability distribution over the variables that do not appear in x or e, which we will denote as Y.

Support Vector Machines (SVM): It is a machine learning algorithm which

 Solves Classification Problem

 Uses a flexible representation of the class boundaries

 Implements automatic complexity control to reduce overfitting

 Has a single global minimum which can be found in polynomial time

Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for
both classification or regression challenges. However, it is mostly used in classification problems.
In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is number
of features you have) with the value of each feature being the value of a particular coordinate. Then,
we perform classification by finding the hyper-plane that differentiates the two classes very well

Working of SVM:
1. Identify the right hyper-plane (Scenario-1): Here, we have three hyper-planes (A, B and C).
Now, identify the right hyper-plane to classify star and circle.

“Select the hyper-plane which segregates the two classes better”. In this scenario, hyper-plane “B”
has excellently performed this job.

 Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B and
C) and all are segregating the classes well. Now, How can we identify the right hyper-plane?
Here, maximizing the distances between nearest data point (either class) and hyper-plane will help
us to decide the right hyper-plane. This distance is called as Margin. Let’s look at the below
snapshot:

Above, you can see that the margin for hyper-plane C is high as compared to both A and B. Hence,
we name the right hyper-plane as C. Another lightning reason for selecting the hyper-plane with
higher margin is robustness. If we select a hyper-plane having low margin then there is high chance
of miss-classification
Identify the right hyper-plane (Scenario-3):Hint: Use the rules as discussed in previous section to
identify the right hyper-plane

Some of you may have selected the hyper-plane B as it has higher margin compared to A. But, here
is the catch, SVM selects the hyper-plane which classifies the classes accurately prior
to maximizing margin. Here, hyper-plane B has a classification error and A has classified all
correctly. Therefore, the right hyper-plane is A.

SVM Kernel: In the SVM classifier, it is easy to have a linear hyper-plane between these
two classes. But, another burning question which arises is, should we need to add this
feature manually to have a hyper-plane. No, the SVM algorithm has a technique called the
kernel trick. The SVM kernel is a function that takes low dimensional input space and
transforms it to a higher dimensional space i.e. it converts not separable problem to
separable problem. It is mostly useful in non-linear separation problem. Simply put, it does
some extremely complex data transformations, then finds out the process to separate the data
based on the labels or outputs you’ve defined.

Analysis of time series:

Time series: It is a collection of observations made sequentially in time. In this, successive


observations are NOT independent and the order of observation is crucial.

Goals of time series analysis:

1. Descriptive: Identify patterns in correlated data trends and seasonal variation

2. Explanation: understanding and modeling the data

3. Forecasting: prediction of short-term trends from previous patterns

4. Intervention analysis: how does a single event change the time series?

5. Quality control: deviations of a specified size indicate a problem

Linear System Analysis:

A time series model is a tool used to predict future values of a series by analyzing the relationship
between the values observed in the series and the time of their occurrence. Time series models can
be developed using a variety of time series statistical techniques. If there has been any trend and/or
seasonal variation present in the data in the past then time series models can detect this variation,
use this information in order to fit the historical data as closely as possible, and in doing so improve
the precision of future forecasts. There are many traditional techniques used in time series analysis.
Some of these include:

■Exponential Smoothing

■Linear Time Series Regression and Curvefit

■Autoregression

■ARIMA (Autoregressive Integrated Moving Average)

■Intervention Analysis

■Seasonal Decomposition
ARIMA stands for AutoRegressive Integrated Moving Average, and the assumption of these models
is that the variation accounted for in the series variable can be divided into three components:
■Autoregressive (AR)

■Integrated (I) or Difference

■Moving Average (MA)

An ARIMA model can have any component, or combination of components, at both the
nonseasonal and seasonal levels. There are many different types of ARIMA models and the general
form of an ARIMA model is ARIMA(p,d,q)(P,D,Q), where:

■p refers to the order of the nonseasonal autoregressive process incorporated into the ARIMA
model (and P the order of the seasonal autoregressive process)

■d refers to the order of nonseasonal integration or differencing (and D the order of the seasonal
integration or differencing)

■q refers to the order of the nonseasonal moving average process incorporated in the model (and Q
the order of the seasonal moving average process).

Non linear dynamics:

Detecting trends and patterns in financial data is of great interest to the business world to support
the decision-making process. A new generation of methodologies, including neural networks,
knowledge-based systems and genetic algorithms, has attracted attention for analysis of trends and
patterns. In particular, neural networks are being used extensively for financial forecasting with
stock markets, foreign exchange trading, commodity future trading and bond yields. The application
of neural networks in time series forecasting is based on the ability of neural networks to
approximate nonlinear functions. In fact, neural networks offer a novel technique that doesn’t
require a pre-specification during the modelling process because they independently learn the
relationship inherent in the variables. The term neural network applies to a loosely related family of
model, characterized by a large parameter space and flexible structure, descending from studies
such as the study of the brain function.

Rule Induction:

Rule induction is a data mining process of deducing if-then rules from a data set. These symbolic
decision rules explain an inherent relationship between the attributes and class labels in the data set.
Many real-life experiences are based on intuitive rule induction. For example, we can proclaim a
rule that states “if it is 8 a.m. on a weekday, then highway traffic will be heavy” and “if it is 8 p.m.
on a Sunday, then the traffic will be light.” These rules are not necessarily right all the time. 8 a.m.
weekday traffic may be light during a holiday season. But, in general, these rules hold true and are
deduced from real-life experience based on our every day observations. Rule induction provides a
powerful classification approach .

Neural Network: Learning and Generalizatin

Learning:

The principal reason why neural networks have attracted such interest, is the existence of learning
algorithms for neural networks: algorithms that use data to estimate the optimal weights in a
network to perform some task. There are three basic approaches to learning in neural networks.
Supervised learning: It uses a training set that consists of a set of pattern pairs: an input pattern
and the corresponding desired (or target) output pattern. The desired output maybe regarded as the
‘network’s ‘teacher” for that input. The basic approach in supervised learning is for the network to
compute the output its current weights producefor a given input, and to compare this network output
with the desired output. Theaim of the learning algorithm is to adjust the weights so as minimize the
difference between the network output and the desired output.

Reinforcement learning: It uses much less supervision. If a network aims to perform that some task,
then the reinforcement signal is a simple “yes” or “no” at the endof thetask to indicate whether the
task has been performed satisfactorily.

Unsupervised learning: It only uses input data there is no training signal, unlike the previous two
approaches. The aim of unsupervised learning is to make sense of some data set,for example
clustering similar patterns together. Compression.

Generalization of Neural Network: One of the major advantages of neural nets is their ability to
generalize. This means that a trained net could classify data from the same class as the learning data
that it has never seen before. In real world applications developers normally have only a small part
of all possible patterns for the generation of a neural net. To reach the best generalization, the
dataset should be split into three parts:
 The training set is used to train a neural net. The error of this dataset is minimized during
training.
 The validation set is used to determine the performance of a neural network on patterns that
are not trained during learning.
 A test set for finally checking the over all performance of a neural net.

The learning should be stopped in the minimum of the validation set error. At this point the net
generalizes best. When learning is not stopped, overtraining occurs and the performance of the net
on the whole data decreases, despite the fact that the error on the training data still gets smaller.
After finishing the learning phase, the net should be finally checked with the third data set, the test
set.

Competitive learning: is a form of unsupervised learning in artificial neural networks, in which


nodes compete for the right to respond to a subset of the input data. ... Models and algorithms
based on the principle of competitive learning include vector quantization and self-organizing
maps (Kohonen maps).

In a competitive learning model, there are hierarchical sets of units in the network with inhibitory
and excitatory connections. The excitatory connections are between individual layers and the
inhibitory connections are between units in layered clusters. Units in a cluster are either active or
inactive.

Principal Component Analysis: Principal components analysis (PCA) is a statistical technique that
allows identifying underlying linear patterns in a data set so it can be expressed in terms of other
data set of a significatively lower dimension without much loss of information.

The final data set should explain most of the variance of the original data set by reducing the
number of variables. The final variables will be named as principal components.

The steps to perform principal components analysis are the following:

1. Subtract mean: The first step in the principal component analysis is to subtract the mean for
each variable of the data set.
2. Calculate the covariance matrix: The covariance of two random variables measures the
degree of variation from their means for each other. The sign of the covariance provides us
with information about the relation between them:
• If the covariance is positive, then the two variables increase and decrease together.
• If the covariance is negative, then when one variable increases, the other decreases,
and vice versa.
These values determine the linear dependencies between the variables, which will be used to reduce
the data set's dimension.

3. Calculate eigenvectors and eigenvalues: Eigenvectors are defined as those vectors whose
directions remain unchanged after any linear transformation has been applied. However,
their length could not remain the same after the transformation, i.e., the result of this
transformation is the vector multiplied by a scalar. This scalar is called eigenvalue, and each
eigenvector has one associated with it. The number of eigenvectors or components that we
can calculate for each data set is equal to the data set's dimension.Since they are calculated
from the covariance matrix described before, eigenvectors represent the directions in which
the data have a higher variance. On the other hand, their respective eigenvalues determine
the amount of variance that the data set has in that direction.
4. Select principal components: Among the available eigenvectors that previously calculated,
we must select those onto which we project the data. The selected eigenvectors will be
called principal components. To establish a criterion to select the eigenvectors, we must first
define the relative variance of each and the total variance of a data set. The relative variance
of an eigenvector measures how much information can be attributed to it. The total variance
of a data set is the sum of the variance of all the variables. These two concepts are
determined by the eigenvalues.
5. Reduce the data dimension: Once we have selected the principal components, the data
must be projected onto them. Although this projection can explain most of the variance of
the original data, we have lost the information about the variance along with the second
component. In general, this process is irreversible, which means that we cannot recover the
original data from the projection.

The summary of PCA can be given as follows:


Fuzzy Logic: It is an approach of reasoning to make decisions by the humans which involve digital
value yes or no. It uses a fuzzy set with a fuzzy logic computer process using natural language.
They are applied in rule-based automatic controllers establishes non-linear mapping and considered
to be a designed method by the consumers. This system works on the principle based on the
probability of input state a particular output is been assigned. The word fuzzy means precision to
imprecision. It comprises four components namely fuzzifier, rules, inference engine, defuzzifier. In
fuzzy true statements become a matter of degree.
Definition

It is defined as a control logic that pretends to use degrees of input and output to estimate human
reasoning with the integration of rule-based implementation. The technique used in the
manipulation of undesired information or facts which involves some degree of uncertainty.

Extracting Fuzzy Models from the Data

In the context of intelligent data analysis it is of great interest how such fuzzy models can
automatically be derived from example data. Since, besides predic-tion, understandability is of
prime concern, the resulting fuzzy model should offer insights into the underlying system. To
achieve this, different approaches exist that construct grid-based rule sets defining a global
granulation of the input space, as well as fuzzy graph based structures.

Extracting Grid-Based Fuzzy Models from Data

Grid-based rule sets model each input variable through a usually small set of lin-guistic values. The
resulting rule base uses aU or a subset of all possible combina-tions of these linguistic values for
each variable, resulting in a global granulation of the feature space into "tiles. Extracting grid-based
fuzzy models from data is straightforward when the input granulation is fixed, that is, the
antecedents of all rules are predefined. Then only a matching consequent for each rule needs to be
found.

Extracting Fuzzy Graphs from Data

In high-dimensional feature spaces a global granulation results in a large number of rules. For these
tasks a fuzzy graph based approach is more suitable. A possible disadvantage of the individual
membership functions is the poten-tial loss of interpretation. Projecting all membership functions
onto one variable will usually not lead to meaningful linguistic values. In many data analysis ap-
plications, however, such a meaningful granulation of all attributes is either not available or hard to
determine automatically.

The primary function of a fuzzy graph isto serve as a representation of an imprecisely defined
dependency. Fuzzy graphs do not have a natural linguistic interpretation of the granulation of their
input space. The main advantage is the low dimensionality of the individual rules. The algorithm
only introduces restriction on few of the available input variables, thus making the extracted rules
easier to interpret. The final set of fuzzy points forms a fuzzy graph, where each fuzzy point is
associated with one output region and is used to compute a membership value for a certain input
pattern. The maximum degree of membership of all fuzzy points for one region determines the
overall degree of membership. Fuzzy inference then produces a soft value for the output and using
the well-known center-of-gravity method a final crisp output value can be obtained, if so desired.

Fuzzy Decision Trees

An extension to decision trees based on fuzzy logic can be derived. Different branches of the tree
are then distinguished by fuzzy queries .The introduction of fuzzy set theory in Zadeh (1965),
offered a general methodology that allows notions of vagueness and imprecision to be considered.
Moreover, Zadeh’s work allowed the possibility for previously defined techniques to be considered
with a fuzzy environment. It was over ten years later that the area of decision trees benefited from
this fuzzy environment opportunity.

Decision trees based on fuzzy set theory combines the advantages of good comprehensibility of
decision trees and the ability of fuzzy representation to deal with inexact and uncertain
information.”

In fuzzy set theory (Zadeh, 1965), the grade of membership of a value x to a set S is defined
through a membership function ji(x) that can take a value in the range [0, 1]. The accompanying

numerical attribute domain can be described by a finite series of MFs that each offers a grade of
membership to describe x, which collectively form its concomitant fuzzy number. In this article,
MFs are used to formulate linguistic variables for the considered attributes. These linguistic
variables are made up of sets of linguistic terms which are defined by the Mfs..

Constrcution of Fuzzy decision tree:

The small data set considered, consists of five objects, described by three condition attributes T1,
T2 and T3, and classified by a single decision attribute C, see Table 1.

If these values are considered imprecise, fuzzy, there is the option to transform the data values in
fuzzy values. Here, an attribute is transformed into a linguistic variable, each described by two
linguistic term.
In Figure 2, the decision attribute C is shown to be described by the linguistic terms, CL and CH
(possibly denoting the terms low and high). These linguistic terms are themselves defined by MFs
(^C (■) and (■). The hypothetical MFs shown have the respective defining terms of , (■): [-<, -<, 9,
25, 32] and MCh(‘): [9, 25, 32, <, <]. To demonstrate their utilisation, for the obj ect u2, with a
value C = 17, its fuzzification creates the two values ^ (17) = 0.750 and ^ (17) = 0.250, the larger of
which is associated with the high linguistic term.

A similar series of membership functions can be constructed for the three condition attributes, T1,
T2 and T3, Figure 3.

In Figure 3, the linguistic variable version of each condition attribute is described by two linguistic
terms (possibly termed as low and high), themselves defined by MFs. The use of these series of
MFs is the ability to fuzzify the example data set, see Table 2.
In Table 2, each object is described by a series of fuzzy values, two fuzzy values for each
attribute. Also shown in Table 2, in bold, are the larger of the values in each pair of fuzzy values,
with the respective linguistic term this larger value is associated with. Beyond the fuzzification of
the data set, attention turns to the construction of the concomitant fuzzy decision tree for this data.
Prior to this construction process, a threshold value of p = 0.75 for the minimum required truth level
was used throughout.

The construction process starts with the condition attribute that is the root node. For this, it is
necessary to calculate the classification ambiguity G(E) of each condition attribute. The evaluation
of a G(E) value is shown for the first attribute T1 (i.e. g(n(C| T1))), where it is broken down to the
fuzzy labels L and H, for L;

The subsethood values in this case are; for T1: i(T1L, CL) = 0.574 and S(T1L, CH) = 0.426, and
S(T2H, CL) = 0.452 and S(T2H, CH) = 0.548. For T2L and T2H, Hie larger subsethood value (in
bold), defines the possible classification for that path. In both cases these values are less that the
threshold truth value 0.75 employed, so neither of these paths can be terminated to a leaf node,
instead further augmentation of them is considered.

With three condition attributes included in the example data set, the possible augmentation to
T1L is with either T2 or T3. Concentrating on T2, where with G(T1l) = 0. 0.514, the ambiguity with
partition evaluated for T2 (G(T1L and T2| C)) has to be less than this value, where;

Starting with the weight values, in the case of T1L and T2l, it follows;
A concomitant value for G(T1L and T3| C) = 0.487, the lower of these (G(T1L and T2| C)) is
lower than the concomitant G(T1L) = 0.514, so less ambiguity would be found if the T2 attribute
was augmented to the path T1 = L. The subsequent subsethood values in

each suggested classification path, the largest subset-hood value is above the truth level threshold,
therefore they are both leaf nodes leading from the T1 = L path. The construction process continues
in a similar vein for the path T1 = H, with the resultant fuzzy decision tree in this case presented in
Figure 4.
The fuzzy decision tree shows five rules (leaf nodes), R1, R2, …, R5, have been constructed.
There are a maximum of four levels to the tree shown, indicating a maximum of three condition
attributes are used in the rules constructed. In each non-root node shown the subsethood levels to
the decision attribute terms C = L and C = H are shown. On the occasions when the larger of the
subsethood values is above the defined threshold value of 0.75 then they are shown in bold and
accompany the node becoming a leaf node.

Stochastic Search by Simulated Annealing :

Stochastic search is the method of choice for solving many hard combinatorial problems.Stochastic
search algorithms are designed for problems with inherent random noise or deterministic problems
solved by injected randomness. In structural optimization, these are problems with uncertainties of
design variables or those where adding random perturbation to deterministic design variables is the
method to perform the search The search favors designs with better performance. An important
feature of stochastic search algorithms is that they can carry out broad search of the design space
and thus avoid local optima. Also, stochastic search algorithms do not require gradients to guide the
search, making them a good fit for discrete problems. However, there is no necessary condition for
an optimum solution and the algorithm must run multiple times to make sure the attained solutions
are robust. To handle constraints, penalties can also be applied on designs that violate constraints.
For constraints that are difficult to be formulated explicitly, a true/false check is straightforward to
implement. Randomly perturbed designs are checked against constraints, and only those passing the
check will enter the stage of performance evaluation. Stochastic search can be applied on one
design or on a population of them (Leng, 2015), using for example SA or GA, respectively. Arora
(2004) depicts the logic of SA and GA for convenience of application. A monograph devoted to
stochastic search and optimization (Spall, 2003) provides further details on a broad scope, including
mathematical theory, algorithm design, and applications in simulation and control.

The simulated annealing approach is in the realm of problem solving methods that make use of
paradigms found in nature. Annealing denotes the process of cooling a molten substance and is, for
example, used to harden steel. One major effect of "careful" annealing is the condensing of matter
into a crystalline solid. The hardening of steel is achieved by first raising the temperature close to
the transition to its hquid phase, then cooling the steel slowly to aUow the molecules to arrange in
an ordered lattice pattern. Hence, annealing can be regarded as an adaptation process optimizing the
stability of the final crystalline solid. Whether a state of minimum free energy is reached, depends
very much on the actual speed with which the temperature is being decreased.
Unit-III

Mining Data Steams


Introduction to streams concepts:

Stream

•Refers to a sequence of data elements or symbols made available over time

•Data stream transmits from a source and receives at the processing end in a network

•A continuous stream of data flows between the source and receiver ends, and which is processed in
real time

• Also refers to communication of bytes or characters over sockets in a computer network

• A program uses stream as an underlying data type in inter-process communication channels.

Data Streams: In many data mining situations, we do not know the entire data set in advance

• Stream Management is important when the input rate is controlled externally:

• Google queries

• Twitter or Facebook status updates


• We can think of the data as infinite and non-stationary (the distribution changes over time)

Imagine a factory with 500 sensors capturing 10 KB of information every second, in one hour is
captured nearby 36 GB of information and 432 GB daily. This massive information needs to be
analysed in real time (or in the shortest time possible) to detect irregularities or deviations in the
system and quickly react. Stream Mining enables to analyse large amounts of data in real-time.
Stream Mining enables the analysis of massive quantities of data in real time using bounded
resources.

Data Stream Mining is the process of extracting knowledge from continuous rapid data records
which comes to the system in a stream. A Data Stream is an ordered sequence of instances in time
Data Stream Mining fulfil the following characteristics:

• Continuous Stream of Data. High amount of data in an infinite stream. we do not know the
entire dataset
• Concept Drifting. The data change or evolves over time
• Volatility of data. The system does not store the data received (Limited resources). When
data is analysed it’s discarded or summarised

Stream Data Model

In analogy to a database-management system, we can view a stream processor as a kind of data-


management system, the high-level organization of which is suggested in Fig. 4.1. Any number of
streams can enter the system. Each stream can provide elements at its own schedule; they need not
have the same data rates or data types, and the time between elements of one stream need not be
uniform. The fact that the rate of arrival of stream elements is not under the control of the system
distinguishes stream processing from the processing of data that goes on within a database-
management system. The latter system controls the rate at which data is read from the disk, and
therefore never has to worry about data getting lost as it attempts to execute queries.

Streams may be archived in a large archival store, but we assume it is not possible to answer queries
from the archival store. It could be examined only under special circumstances using time-
consuming retrieval processes. There is also a working store, into which summaries or parts of
streams may be placed, and which can be used for answering queries. The working store might be
disk, or it might be main memory, depending on how fast we need to process queries.

But either way, it is of sufficiently limited capacity that it cannot store all the data from all the
streams
Input elements enter at a rapid rate, at one or more input ports (i.e., streams). We call elements of
the stream tuples. The system cannot store the entire stream accessibly. The question arises:

Q: How do you make critical calculations about the stream using a limited amount of (secondary)
memory?

Streaming Data Architecture

A streaming data architecture is an information technology framework that puts the focus on
processing data in motion and treats extract-transform-load (ETL) batch processing as just one more
event in a continuous stream of events. This type of architecture has three basic components -- an
aggregator that gathers event streams and batch files from a variety of data sources, a broker that
makes data available for consumption and an analytics engine that analyzes the data, correlates
values and blends streams together.

The system that receives and sends data streams and executes the application and real-time analytics
logic is called the stream processor. Because a streaming data architecture supports the concept of
event sourcing, it reduces the need for developers to create and maintain shared databases. Instead,
all changes to an application’s state are stored as a sequence of event-driven processing (ESP)
triggers that can be reconstructed or queried when necessary. Upon receiving an event, the stream
processor reacts in real- or near real-time and triggers an action, such as remembering the event for
future reference.

The growing popularity of streaming data architectures reflects a shift in the development of
services and products from a monolithic architecture to a decentralized one built with
microservices. This type of architecture is usually more flexible and scalable than a classic
database-centric application architecture because it co-locates data processing with storage to lower
application response times (latency) and improve throughput. Another advantage of using a
streaming data architecture is that it factors the time an event occurs into account, which makes it
easier for an application’s state and processing to be partitioned and distributed across many
instances.

Streaming data architectures enable developers to develop applications that use both bound and
unbound data in new ways. For example, Alibaba’s search infrastructure team uses a streaming data
architecture powered by Apache Flink to update product detail and inventory information in real-
time. Netflix also uses Flink to support its recommendation engines and ING, the global bank based
in The Netherlands, uses the architecture to prevent identity theft and provide better fraud
protection. Other platforms that can accommodate both stream and batch processing include Apache
Spark, Apache Storm, Google Cloud Dataflow and AWS Kinesis.

Stream Computing

A high-performance computer system that analyzes multiple data streams from many sources live.
The word stream in stream computing is used to mean pulling in streams of data, processing the
data and streaming it back out as a single flow. Stream computing uses software algorithmsthat
analyzes the data in real time as it streams in to increase speed and accuracy when dealing with data
handling and analysis.

In June 2007, IBM announced its stream computing system, called System S. This system runs on
800 microprocessors and the System S software enables software applications to split up tasks and
then reassemble the data into an answer. ATI Technologies also announced a stream computing
technology that describes its technology that enables the graphics processors (GPUs) to work in
conjunction with high-performance, low-latency CPUs to solve complex computational problems.
ATI’s stream computing technology is derived from a class of applications that run on the GPU
instead of a CPU.
Steam Computing Enables continuous analysis of massive volumes of streaming data with sub-
millisecond response times. Learn more in: Big Data Techniques and Applications. The Usage of
single instruction multiple data computing paradigm to solve certain computational problems. Learn
more in: High-Performance Computing for Theoretical Study of Nanoscale and Molecular
Interconnects .
Stream Computing

Sampling data from a Stream

Stream sampling is the process of collecting a representative sample of the elements of a data
stream. The sample is usually much smaller than the entire stream, but can be designed to retain
many important characteristics of the stream, and can be used to estimate many important
aggregates on the stream.

As the stream grows the sample also gets bigger. Since we can not store the entire stream, one
obvious approach is to store a sample.

Two different problems:

(1) Sample a fixed proportion of elements in the stream (say 1 in 10)

(2) Maintain a random sample of fixed size over a potentially infinite stream

At any “time” k we would like a random sample of s elements

What is the property of the sample we want to maintain?For all time steps k, each of k elements
seen so far has equal prob. of being sampled

Problem 1: Sampling fixed proportion


Scenario: Search engine query stream

Stream of tuples: (user, query, time)


Answer questions such as: How often did a user run the same query in a single days

Have space to store 1/10th of query stream


Naïve solution:

Generate a random integer in [0..9] for each query

Store the query if the integer is 0, otherwise discard

Simple question: What fraction of queries by an average search engine user are duplicates?

Suppose each user issues x queries once and d queries twice (total of x+2d queries)

Correct answer: d/(x+d)

Proposed solution: We keep 10% of the queries

Sample will contain x/10 of the singleton queries and 2d/10 of the duplicate queries at least once

But only d/100 pairs of duplicates

d/100 = 1/10 ∙ 1/10 ∙ d

Of d “duplicates” 18d/100 appear exactly once

18d/100 = ((1/10 ∙ 9/10)+(9/10 ∙ 1/10)) ∙ d

Solution:
th
• Pick 1/10 of users and take all their searches in the sample

• Use a hash function that hashes the user name or user id uniformly into 10 buckets

Filtering Streams

Another common process on streams is selection, or filtering. We want to accept those tuples in the
stream that meet a criterion. Accepted tuples are passed to another process as a stream, while other
tuples are dropped. If the selection criterion is a property of the tuple that can be calculated (e.g., the
first component is less than 10), then the selection is easy to do. The problem becomes harder when
the criterion involves lookup for membership in a set. It is especially hard, when that set is too large
to store in main memory

The Bloom Filter


A Bloom filter consists of:

1. An array of n bits, initially all 0’s.

2. A collection of hash functions h 1 , h 2 , . . . , h k . Each hash function maps “key” values to n


buckets, corresponding to the n bits of the bit-array.

3. A set S of m key values.

The purpose of the Bloom filter is to allow through all stream elements whose keys are in S, while
rejecting most of the stream elements whose keys are not in S.

To initialize the bit array, begin with all bits 0. Take each key value in S and hash it using each of
the k hash functions. Set to 1 each bit that is h i (K) for some hash function h i and some key value
K in S.

To test a key K that arrives in the stream, check that all of

h 1 (K), h 2 (K), . . . , h k (K)

are 1’s in the bit-array. If all are 1’s, then let the stream element through. If one or more of these bits
are 0, then K could not be in S, so reject the stream element

Counting Distinct Element of a Stream

Suppose stream elements are chosen from some universal set. We would like to know how many
different elements have appeared in the stream, counting either from the beginning of the stream or
from some known time in the past.

A similar problem is a Web site like Google that does not require login to issue a search query, and
may be able to identify users only by the IP address from which they send the query. There are
about 4 billion IP addresses, 2 sequences of four 8-bit bytes will serve as the universal set in this
case. 2

The obvious way to solve the problem is to keep in main memory a list of all the elements seen so
far in the stream. Keep them in an efficient search structure such as a hash table or search tree, so
one can quickly add new elements and check whether or not the element that just arrived on the
stream was already seen. As long as the number of distinct elements is not too great, this structure

can fit in main memory and there is little problem obtaining an exact answer to the question how
many distinct elements appear in the stream.

However, if the number of distinct elements is too great, or if there are too many streams that need
to be processed at once (e.g., Yahoo! wants to count the number of unique users viewing each of its
pages in a month), then we cannot store the needed data in main memory. There are several options.
We could use more machines, each machine handling only one or several of the streams. We could
store most of the data structure in secondary memory and batch stream elements so whenever we
brought a disk block to main memory there would be many tests and updates to be performed on the
data in that block.

The Flajolet-Martin Algorithm

It is possible to estimate the number of distinct elements by hashing the elements of the universal
set to a bit-string that is sufficiently long. The length of the bit-string must be sufficient that there
are more possible results of the hash function than there are elements of the universal set. For
example, 64 bits is sufficient to hash URL’s. We shall pick many different hash functions and hash
each element of the stream using these hash functions. The important property of a hash function is
that when applied to the same element, it always produces the same result.

Estimating Moments

The problem, called computing “moments,” involves the distribution of frequencies of different
elements in the stream.

Definition of Moments

Suppose a stream consists of elements chosen from a universal set. Assume the universal set is
ordered so we can speak of the ith element for any i. Let m i be the number of occurrences of the ith
element for any i. Then the kth-order moment (or just kth moment) of the stream is the sum over all
i of (m i ) k.

The 1st moment is the sum of the m i ’s, which must be the length of the stream. Thus, first
moments are especially easy to compute; just count the length of the stream seen so far.

The second moment is the sum of the squares of the m i ’s. It is sometimes called the surprise
number, since it measures how uneven the distribution of elements in the stream is.

To see the distinction, suppose we have a stream of length 100, in which eleven different elements
appear. The most even distribution of these eleven elements would have one appearing 10 times

and the other ten appearing 9 times each. In this case, the surprise number is 10 2 + 10 × 9 2 = 910.
At the other extreme, one of the eleven elements could appear 90 times and the other ten appear 1
time each. Then, the surprise number would be 90 2 + 10 × 1 2 = 8110.

The Alon-Matias-Szegedy Algorithm for Second Moments


For now, let us assume that a stream has a particular length n. We shall show how to deal with
growing streams in the next section. Suppose we do not have enough space to count all the m i ’s
for all the elements of the stream. We can still estimate the second moment of the stream using a
limited amount of space; the more space we use, the more accurate the estimate will be. We
compute some number of variables. For each variable X, we store:

1. A particular element of the universal set, which we refer to as X.element , and

2. An integer X.value, which is the value of the variable. To determine the value of a variable X, we
choose a position in the stream between 1 and n, uniformly and at random. Set X.element to be the
element found there, and initialize X.value to 1. As we read the stream, add 1 to X.value each time
we encounter another occurrence of X.element .

Example 4.7 : Suppose the stream is a, b, c, b, d, a, c, d, a, b, d, c, a, a, b. The length of the stream is


n = 15. Since a appears 5 times, b appears 4 times, and c and d appear three times each, the second
moment for the stream is 5 2 + 4 2 + 3 2 + 3 2 = 59. Suppose we keep three variables, X 1 , X 2 ,
and X 3 . Also, assume that at “random” we pick the 3rd, 8th, and 13th positions to define these
three variables.

When we reach position 3, we find element c, so we set X 1 .element = c and X 1 .value = 1.


Position 4 holds b, so we do not change X 1 . Likewise, nothing happens at positions 5 or 6. At
position 7, we see c again, so we set X 1 .value = 2.

At position 8 we find d, and so set X 2 .element = d and X 2 .value = 1. Positions 9 and 10 hold a
and b, so they do not affect X 1 or X 2 . Position 11 holds d so we set X 2 .value = 2, and position
12 holds c so we set X 1 .value = 3. At position 13, we find element a, and so set X 3 .element = a
and X 3 .value = 1. Then, at position 14 we see another a and so set X 3 .value = 2. Position 15,
with element b does not affect any of the variables, so we are done, with final values X 1 .value = 3
and X 2 .value = X 3 .value = 2. We can derive an estimate of the second moment from any variable
X. This estimate is n(2X.value − 1).

Consider the three variables from the above example. From X 1 we derive the estimate n(2X 1
.value − 1) = 15 × (2 × 3 − 1) = 75. The other two variables, X 2 and X 3 , each have value 2 at the
end, so their estimates are 15 × (2 × 2 − 1) = 45. Recall that the true value of the second moment for
this stream is 59. On the other hand, the average of the three estimates is 55, a fairly close
approximation.

Counting oneness in a window

Suppose we have a window of length N on a binary stream. We want at all times to be able to
answer queries of the form “how many 1’s are there in the last k bits?” for any k ≤ N . As in
previous sections, we focus on the situation where we cannot afford to store the entire window.
After showing an approximate algorithm for the binary case, we discuss how this idea can be
extended to summing numbers.

The Cost of Exact Counts

To begin, suppose we want to be able to count exactly the number of 1’s in the last k bits for any k ≤
N . Then we claim it is necessary to store all N bits of the window, as any representation that used
fewer than N bits could not work. In proof, suppose we have a representation that uses fewer than N

bits to represent the N bits in the window. Since there are 2 N sequences of N bits, but fewer than 2
N representations, there must be two different bit strings w and x that have the same representation.
Since w 6 = x, they must differ in at least one bit. Let the last k − 1 bits of w and x agree, but let
them differ on the kth bit from the right end.

Example : If w = 0101 and x = 1010, then k = 1, since scanning from the right, they first disagree at
position 1. If w = 1001 and x = 0101, then k = 3, because they first disagree at the third position
from the right.

Suppose the data representing the contents of the window is whatever sequence of bits represents
both w and x. Ask the query “how many 1’s are in the last k bits?” The query-answering algorithm
will produce the same answer, whether the window contains w or x, because the algorithm can only
see their representation. But the correct answers are surely different for these two bit-strings. Thus,
we have proved that we must use at least N bits to answer queries about the last k bits for any k.

In fact, we need N bits, even if the only query we can ask is “how many 1’s are in the entire window
of length N ?” The argument is similar to that used above. Suppose we use fewer than N bits to
represent the window, and therefore we can find w, x, and k as above. It might be that w and x have

the same number of 1’s, as they did in both cases of Example 4.10. However, if we follow the
current window by any N − k bits, we will have a situation where the true window contents
resulting from w and x are identical except for the leftmost bit, and therefore, their counts of 1’s are
unequal. However, since the representations of w and x are the same, the representation of the
window must still be the same if we feed the same bit sequence to these representations.

The Datar-Gionis-Indyk-Motwani Algorithm

We shall present the simplest case of an algorithm called DGIM. This version of the algorithm uses
O(log 2 N ) bits to represent a window of N bits, and allows us to estimate the number of 1’s in the
window with an error of no more than 50%. Later, we shall discuss an improvement of the method
that limits the error to any fraction ǫ > 0, and still uses only O(log 2 N ) bits (although with a
constant factor that grows as ǫ shrinks).

To begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on.

Since we only need to distinguish positions within the window of length N , we shall represent
timestamps modulo N , so they can be represented by log 2 N bits. If we also store the total number
of bits ever seen in the stream (i.e., the most recent timestamp) modulo N , then we can determine
from a timestamp modulo N where in the current window the bit with that timestamp is.

We divide the window into buckets, 5 consisting of:

1. The timestamp of its right (most recent) end.

2. The number of 1’s in the bucket. This number must be a power of 2, and we refer to the number
of 1’s as the size of the bucket.

To represent a bucket, we need log 2 N bits to represent the timestamp (modulo N ) of its right end.
To represent the number of 1’s we only need log 2 log 2 N bits. The reason is that we know this
number i is a power of 2, say 2 j , so we can represent i by coding j in binary. Since j is at most log 2
N , it requires log 2 log 2 N bits. Thus, O(log N ) bits suffice to represent a bucket.

There are six rules that must be followed when representing a stream by buckets.

• The right end of a bucket is always a position with a 1.

• Every position with a 1 is in some bucket.

• No position is in more than one bucket.

• There are one or two buckets of any given size, up to some maximum size.

• All sizes must be a power of 2.

• Buckets cannot decrease in size as we move to the left (back in time).

Decaying Windows

We have assumed that a sliding window held a certain tail of the stream, either the most recent N
elements for fixed N , or all the elements that arrived after some time in the past. Sometimes we do
not want to make a sharp distinction between recent elements and those in the distant past, but want
to weight the recent elements more heavily. In this section, we consider “exponentially decaying
windows,” and an application where they are quite useful: finding the most common “recent”
elements.We have assumed that a sliding window held a certain tail of the stream, either the most
recent N elements for fixed N , or all the elements that arrived after some time in the past.
Sometimes we do not want to make a sharp distinction between recent elements and those in the
distant past, but want to weight the recent elements more heavily.

The Problem of Most-Common Elements

Suppose we have a stream whose elements are the movie tickets purchased all over the world, with
the name of the movie as part of the element. We want to keep a summary of the stream that is the
most popular movies “currently.” While the notion of “currently” is imprecise, intuitively, we want
to discount the popularity of a movie like Star Wars–Episode 4, which sold many tickets, but most
of these were sold decades ago. On the other hand, a movie that sold n tickets in each of the last 10
weeks is probably more popular than a movie that sold 2n tickets last week but nothing in previous
weeks. One solution would be to imagine a bit stream for each movie. The ith bit has value 1 if the
ith ticket is for that movie, and 0 otherwise. Pick a window size N , which is the number of most
recent tickets that would be considered in evaluating popularity. Then, use the method of Section
4.6 to estimate the number of tickets for each movie, and rank movies by their estimated counts.
This technique might work for movies, because there are only thousands of movies, but it would fail
if we were instead recording the popularity of items sold at Amazon, or the rate at which different
Twitter-users tweet, because there are too many Amazon products and too many tweeters.

Definition of the Decaying Window

An alternative approach is to redefine the question so that we are not asking for a count of 1’s in a
window. Rather, let us compute a smooth aggregation of all the 1’s ever seen in the stream, with
decaying weights, so the further back in the stream, the less weight is given. Formally, let a stream
currently consist of the elements a 1 , a 2 , . . . , a t , where a 1 is the first element to arrive and a t is

the current element. Let c be a small constant, such as 10 −6 or 10 −9 .


The effect of this definition is to spread out the weights of the stream elements as far back in time as
the stream goes. In contrast, a fixed window with the same sum of the weights, 1/c, would put equal
weight 1 on each of the most recent 1/c elements to arrive and weight 0 on all previous elements.

It is much easier to adjust the sum in an exponentially decaying window than in a sliding window of
fixed length. In the sliding window, we have to worry about the element that falls out of the window
each time a new element arrives. That forces us to keep the exact elements along with the sum, or to
use an approximation scheme such as DGIM. However, when a new element a t+1 arrives at the
stream input, all we need to do is:

1. Multiply the current sum by 1 − c.

2. Add a t+1 .

The reason this method works is that each of the previous elements has now moved one position
further from the current element, so its weight is multiplied by 1 − c. Further, the weight on the
current element is (1 − c) 0 = 1, so adding a t+1 is the correct way to include the new element’s
contribution.

Finding the Most Popular Elements

Let us return to the problem of finding the most popular movies in a stream of ticket sales. 6 We
shall use an exponentially decaying window with a constant c, which you might think of as 10 −9 .
That is, we approximate a sliding window holding the last one billion ticket sales. For each movie,
we imagine a separate stream with a 1 each time a ticket for that movie appears in the stream, and a

0 each time a ticket for some other movie arrives. The decaying sum of the 1’s measures the current
popularity of the movie. We imagine that the number of possible movies in the stream is huge, so
we do not want to record values for the unpopular movies. Therefore, we establish a threshold, say
1/2, so that if the popularity score for a movie goes below this number, its score is dropped from the
counting. For reasons that will become obvious, the threshold must be less than 1, although it can be
any number less than 1. When a new ticket arrives on the stream, do the following:
1. For each movie whose score we are currently maintaining, multiply its score by (1 − c).

2. Suppose the new ticket is for movie M . If there is currently a score for M , add 1 to that score. If
there is no score for M , create one and initialize it to 1.

3. If any score is below the threshold 1/2, drop that score.

It may not be obvious that the number of movies whose scores are main-tained at any time is
limited. However, note that the sum of all scores is 1/c. There cannot be more than 2/c movies with
score of 1/2 or more, or else the sum of the scores would exceed 1/c. Thus, 2/c is a limit on the
number of movies being counted at any time. Of course in practice, the ticket sales would be
concentrated on only a small number of movies at any time, so the number of actively counted
movies would be much less than 2/c.

Real-time analytics

• Refers to finding meaningful patterns in data at the actual time of receiving

• Real-Time Analytics Platform (RTAP) analyses the data, correlates, and predicts the
outcomes in the real time.

RTAP

• Manages and processes data and helps timely decision-making

• Helps to develop dynamic analysis applications

• Leads to evolution of business intelligence

• Apache SparkStreaming—a Big Data platform for data stream analytics in real time.

• Cisco Connected Streaming Analytics (CSA)—a platform that delivers insights from high-
velocity streams of live data from multiple sources and enables immediate action.

RTAP Applications:

• Fraud detection systems for online transactions

• Log analysis for understanding usage pattern

• Click analysis for online recommendations

• Social Media Analytics

• Push notifications to the customers for location-based advertisements for retail

• Action for emergency services such as fires and accidents in an industry


• Any abnormal measurements require immediate reaction in healthcare monitoring

Real-Time Sentiment Analysis:

Positive/Negative Sentiments

Sentiment analysis features

1.NEGATION

2. POSITIVE SMILEY

3. NEGATIVE SMILEY

4. DONT—YOU, OH, SO, AS FAR AS,

5. LAUGH

Stock Market Predictions

Data science is being used to provide a unique understanding of the stock market and financial data.
Securities, commodities, and stocks follow some basic principles for trading. We can either sell,
buy, or hold. The goal is to make the largest profit possible.

Trading platforms became very popular in the last two decades, but each platform offers different
options, tools, fees, etc. Despite this growing trend, Canadians still haven’t been able to access zero
trading commission platforms. Gary Stevens from Hosting Canada conducted a 12-month research
on how some of the most popular stock trading platforms work, and compared what each of them
offers to its users. You need to understand how they work in order to pick what’s best for you – and
Gary’s thorough guide is able to help you with that.

There are a lot of phrases used in data science that a person would have to be a scientist to know. At
its most basic level, data science is math that is sprinkled with an understanding of programming
and statistics.

There are certain concepts in data science that are used when analyzing the market. In this context,
we are using the term “analyze” to determine whether it is worth it to invest in a stock. There are
some basic data science concepts that are good to be familiar with.

Algorithms are used extensively in data science. Basically, an algorithm is a group of rules needed
to perform a task. You have likely heard about algorithms being used when buying and selling
stocks. Algorithmic trading is where algorithms set rules for things like when to buy a stock or
when to sell a stock.
For example, an algorithm could be set to purchase a stock once it drops by eight percent over the
course of the day or to sell the stock if it loses 10 percent of its value compared to when it was first
purchased. Algorithms are designed to function without human intervention. You may have heard of
them referred to as bots. Like robots, they make calculated decisions devoid of emotions.

We are not talking about preparing to run a 50 meter race. In machine learning and data science,
training is where data is used to train a machine on how to respond. We can create a learning model.
This machine learning model makes it possible for a computer to make accurate predictions based
on the information it learned from the past. If you want to teach a machine to predict the future of
stock prices, it would need a model of the stock prices of the previous year to use as a base to
predict what will happen.

We have the data for stock prices for the last year. The training set would be the data from January
to October. Then, we will use November and December as our testing set. Our machine should have
learned by evaluating how the stocks worked from January through October. Now, we will ask it to
predict what should have happened in November and December of that year. The predictions the
machine makes will get compared to the real prices. The amount of variation that we see what the
model predicts and the real data are what we are trying to eliminate as we adjust our training model.

The Role of Modeling to Predict Stock Prices

Data science relies heavily on modeling. This is an approach that uses math to examine past
behaviors with the goal of forecasting future outcomes. In the stock market, a time series model is
used.A time series is data, which in this case refers to the value of a stock, that is indexed over a
period of time. This period of time could be divided hourly, daily, monthly, or even by the minute. A
time series model is created by using machine learning and/or deep learning models to accumulate
the price data. The data needs to be analyzed and then fitted to match the model. This is what makes
it possible to predict future stock prices over a set timetable.

A second type of modeling that is used in machine learning and in data science is referred to as a
classification model. These models are given data points and then they strive to classify or predict
what is represented by those data points.

When discussing the stock market or stocks in general, a machine learning model can be given
financial data like the P/E ratio, total debt, volume, etc. and then determine if a stock is a sound
investment. Depending on the financials we give, a model can determine if now is the time to sell,
hold, or buy a stock.
A model could predict something with so much complexity that it overlooks the relationship
between the feature and the target variable. This is referred to as overfitting. Underfitting is where a
model doesn’t sufficiently match the data, so the results are predictions that are too simple.

Overfitting is a problem if the model finds it difficult to identify stock market trends, so it can’t
adapt to future events. Underfitting is where a model predicts the simple average price based on the
stock’s entire history. Both overfitting and underfitting lead to poor forecasts and predictions.

We have barely scratched the surface when discussing the link between machine learning concepts
and stock market investments. However, it is important to understand the basic concepts we have
discussed today as they serve as a basis for comprehending how machine learning is used to predict
what the stock market can do. There are more concepts that can be learned by those who want to get
to the nitty-gritty of data science and how it relates to the stock market.
Unit-IV

Frequent Itemsets and Clustering


Mining frequent itemsets: Today’s digital world is constantly generating data from traffic
sensors, health sensors, customer transactions, and various other Internet of Things (IoT) devices.
Continuous never-ending streams of Big Data are creating new sets of challenges from the
perspective of data mining. Mining only static data in snapshots of time is no longer useful.
Streaming data, being dynamic or volatile in nature, has changing patterns over time, and this is
more technically known as concept drift. Algorithms developed for mining streaming data have to
be able to detect and work with concept drifts, hence the need for new streaming data mining
approaches.

Frequent itemset mining, a precursor to association rule mining, typically requires significant
processing power since this process involves multiple passes through a database, and this can be a
challenge in large streaming datasets. Though there is great deal of progress in finding frequent
itemsets and association rules in static or permanent databases.

Market basket Modeling

Market Basket Analysis is a technique which identifies the strength of association between pairs of
products purchased together and identify patterns of co-occurrence. A co-occurrence is when two or
more things take place together.

Market Basket Analysis creates If-Then scenario rules, for example, if item A is purchased then
item B is likely to be purchased. The rules are probabilistic in nature or, in other words, they are
derived from the frequencies of co-occurrence in the observations. Frequency is the proportion of
baskets that contain the items of interest. The rules can be used in pricing strategies, product
placement, and various types of cross-selling strategies.

How Market Basket Analysis Works

In order to make it easier to understand, think of Market Basket Analysis in terms of shopping at a
supermarket. Market Basket Analysis takes data at transaction level, which lists all items bought by
a customer in a single purchase. The technique determines relationships of what products were
purchased with which other product(s). These relationships are then used to build profiles
containing If-Then rules of the items purchased.

The rules could be written as:


If {A} Then {B}

The If part of the rule (the {A} above) is known as the antecedent and the THEN part of the rule is
known as the consequent (the {B} above). The antecedent is the condition and the consequent is the
result. The association rule has three measures that express the degree of confidence in the rule,
Support, Confidence, and Lift.

For example, you are in a supermarket to buy milk. Based on the analysis, are you more likely to
buy apples or cheese in the same transaction than somebody who did not buy milk?

Practical Applications of Market Basket Analysis

When one hears Market Basket Analysis, one thinks of shopping carts and supermarket shoppers. It
is important to realize that there are many other areas in which Market Basket Analysis can be
applied. An example of Market Basket Analysis for a majority of Internet users is a list of
potentially interesting products for Amazon. Amazon informs the customer that people who bought
the item being purchased by them, also reviewed or bought another list of items. A list of
applications of Market Basket Analysis in various industries is listed below:

• Retail. In Retail, Market Basket Analysis can help determine what items are purchased
together, purchased sequentially, and purchased by season. This can assist retailers to
determine product placement and promotion optimization (for instance, combining product
incentives). Does it make sense to sell soda and chips or soda and crackers?
• Telecommunications. In Telecommunications, where high churn rates continue to be a
growing concern, Market Basket Analysis can be used to determine what services are being
utilized and what packages customers are purchasing. They can use that knowledge to direct
marketing efforts at customers who are more likely to follow the same path.

For instance, Telecommunications these days is also offering TV and Internet. Creating
bundles for purchases can be determined from an analysis of what customers purchase,
thereby giving the company an idea of how to price the bundles. This analysis might also
lead to determining the capacity requirements.

• Banks. In Financial (banking for instance), Market Basket Analysis can be used to analyze
credit card purchases of customers to build profiles for fraud detection purposes and cross-
selling opportunities.
• Insurance. In Insurance, Market Basket Analysis can be used to build profiles to detect
medical insurance claim fraud. By building profiles of claims, you are able to then use the
profiles to determine if more than 1 claim belongs to a particular claimee within a specified
period of time.
• Medical. In Healthcare or Medical, Market Basket Analysis can be used for comorbid
conditions and symptom analysis, with which a profile of illness can be better identified. It
can also be used to reveal biologically relevant associations between different genes or
between environmental effects and gene expression.

Apriori Algorithm:

Apriori is an algorithm used for Association Rule Mining. It searches for a series of frequent sets of
items in the datasets. It builds on associations and correlations between the itemsets. It is the
algorithm behind “You may also like” where you commonly saw in recommendation platforms.

What is Associate Rule Mining?

ARM( Associate Rule Mining) is one of the important techniques in data science. In ARM, the
frequency of patterns and associations in the dataset is identified among the item sets then used to
predict the next relevant item in the set. This ARM technique is mostly used in business decisions
according to customer purchases.

Example: In Walmart, if Ashok buys Milk and Bread, the chances of him buying Butter are
predicted by the Associate Rule Mining technique.

Some definitions need to be remembered

Before we start, go through some terms which are explained below.

• SUPPORT_COUNT — number of transactions in which the itemset appears.

• MINIMUM_SUPPORT_COUNT — the minimum frequency of itemset in the dataset.

• CANDIDATE_SET — C(k) support_count of each item in the dataset.

• ITEM_SET — L(k) comparing each item in the candidate_set support count to


minimum_support_count and filtering the under frequent itemset.

• SUPPORT — the percentage of transactions in the database follow the rule.

• Support(A->B) = Support_count(A U B)

• CONFIDENCE — the percentage of customers who bought A also bought B.


• Confidence(A->B) = [Support_count(AUB)/Support_count(A)]*100

Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a
dataset for boolean association rule. Name of the algorithm is Apriori because it uses prior
knowledge of frequent itemset properties. We apply an iterative approach or level-wise search
where k-frequent itemsets are used to find k+1 itemsets.

To improve the efficiency of level-wise generation of frequent itemsets, an important property is


used called Apriori property which helps by reducing the search space.

Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is
its anti-monotonicity of support measure. Apriori assumes that All subsets of a frequent itemset
must be frequent(Apriori propertry). If an itemset is infrequent, all its supersets will be infrequent.

minimum support count is 2, minimum confidence is 60%

Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate
set)
(II) compare candidate set item’s support count with minimum support count(here min_support=2 if
support_count of candidate set items is less than min_support then remove those items). This gives
us itemset L1.

Step-2: K=2

• Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and

Lk-1 is that it should have (K-2) elements in common.

• Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.
(Example subset of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
• Now find support count of these itemsets by searching in dataset.

II) compare candidate (C2) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L2.
Step-3:

• Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is that it

should have (K-2) elements in common. So here, for L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4,
I5}{I2, I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not, then remove that itemset.
(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4},
subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
• find support count of these remaining itemset by searching in dataset.

(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if
support_count of candidate set item is less than min_support then remove those items) this gives us
itemset L3.

Step-4:

• Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4) is

that, they should have (K-2) elements in common. So here, for L3, first 2 elements (items)
should match.
• Check all subsets of these itemsets are frequent or not (Here itemset formed by joining L3 is
{I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
• We stop here because no frequent itemsets are found further
Thus, we have discovered all the frequent item-sets. Now generation of strong association rule
comes into picture. For that we need to calculate confidence of each rule.

Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought
butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will show the rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%

So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.

Limitations of Apriori Algorithm

Apriori Algorithm can be slow. The main limitation is time required to hold a vast number of
candidate sets with much frequent itemsets, low minimum support or large itemsets i.e. it is not an
efficient approach for large number of datasets. For example, if there are 10^4 from frequent 1-
itemsets, it need to generate more than 10^7 candidates into 2-length which in turn they will be
tested and accumulate. Furthermore, to detect frequent pattern in size 100 i.e. v1, v2… v100, it have
to generate 2^100 candidate itemsets that yield on costly and wasting of time of candidate
generation. So, it will check for many sets from candidate itemsets, also it will scan database many
times repeatedly for finding candidate itemsets. Apriori will be very low and inefficiency when
memory capacity is limited with large number of transactions.

Handling large data sets in main memory,

1. Allocate More Memory

Some machine learning tools or libraries may be limited by a default memory configuration.

Check if you can re-configure your tool or library to allocate more memory.
A good example is Weka, where you can increase the memory as a parameter when starting the
application.

2. Work with a Smaller Sample


Are you sure you need to work with all of the data?

Take a random sample of your data, such as the first 1,000 or 100,000 rows. Use this smaller sample
to work through your problem before fitting a final model on all of your data (using progressive
data loading techniques).

3. Use a Computer with More Memory


Do you have to work on your computer?

Perhaps you can get access to a much larger computer with an order of magnitude more memory.

For example, a good option is to rent compute time on a cloud service like Amazon Web Services
that offers machines with tens of gigabytes of RAM for less than a US dollar per hour.

4. Change the Data Format


Is your data stored in raw ASCII text, like a CSV file?

Perhaps you can speed up data loading and use less memory by using another data format. A good
example is a binary format like GRIB, NetCDF, or HDF.

There are many command line tools that you can use to transform one data format into another that
do not require the entire dataset to be loaded into memory.

Using another format may allow you to store the data in a more compact form that saves memory,
such as 2-byte integers, or 4-byte floats.

5. Stream Data or Use Progressive Loading


Does all of the data need to be in memory at the same time?

Perhaps you can use code or a library to stream or progressively load data as-needed into memory
for training.

This may require algorithms that can learn iteratively using optimization techniques such as
stochastic gradient descent, instead of algorithms that require all data in memory to perform matrix
operations such as some implementations of linear and logistic regression.
6. Use a Relational Database
Relational databases provide a standard way of storing and accessing very large datasets.

Internally, the data is stored on disk can be progressively loaded in batches and can be queried using
a standard query language (SQL).

Free open source database tools like MySQL or Postgres can be used and most (all?) programming
languages and many machine learning tools can connect directly to relational databases. You can
also use a lightweight approach, such as SQLite.

7. Use a Big Data Platform


In some cases, you may need to resort to a big data platform.

That is, a platform designed for handling very large datasets, that allows you to use data transforms
and machine learning algorithms on top of it.

Two good examples are Hadoop with the Mahout machine learning library and Spark wit the
MLLib library.

Limited pass algorithm:

Algorithms so far: compute exact collection of frequent itemsets of size k in k passes

◗ A-Priori, PCY, Multistage, Multihash

◆ Many applications where it is not essential to discover every frequent itemset

◗ Sufficient to discover most of them

◆ Next: algorithms that find all or most frequent itemsets using at most 2 passes over data

◗ Sampling, SON, ◗ Toivonen’s Algorithm

The Frequent Items Problem

♦The Frequent Items Problem (aka Heavy Hitters): given stream of Nitems, find those that occur
most frequently

♦E.g. Find all items occurring more than 1% of the time

♦Formally “hard” in small space, so allow approximation


♦Find all items with count ≥ φN, none with count < (φ−ε)N–Error 0 < ε< 1, e.g. ε= 1/1000–Related
problem: estimate each frequency with error±εN

Why Frequent Items?

A natural questionon streaming data–Track bandwidth hogs, popular destinations etc.

♦The subject of much streaming research–Scores of papers on the subject

♦A core streaming problem–Many streaming problems connected to frequent items(itemset mining,


entropy estimation, compressed sensing)

♦Many practical applications–Search log mining, network data analysis, DBMS optimization

Frequent Itemset Problem

• Given a stream of items, the problem is simply to find those items which occur most
frequently.

• Formalized as finding all items whose frequency exceeds a specified fraction of the total
number of items.

• Variations arise when the items are given weights, and further when these weights can also
be negative.

• The problem is important both in itself and as a subroutine in more advanced computations.

• For example, It can help in routing decisions, for in-network caching etc (if items represent
packets on the Internet).

• Can help in finding popular terms if items represent queries made to an Internet search
engine.

• Mining frequent itemsets inherently builds on this problem as a basic building block.
• Algorithms for the problem have been applied by large corporations: AT&T and Google.

Solutions
Two main classes of algorithms :
• Counter-based Algorithms

• Sketch Algorithms

Other Solutions :
• Quantiles : based on various notions of randomly sampling items from the input, and
of summarizing the distribution of items.
• Less effective and have attracted less interest.

Counter based Algorithms

• Track a subset of items from the input and monitor their counts.

• Decide for each new arrival whether to store or not.

• Decide what count to associate with it.

• Cannot handle negative weights.

Frequent Algorithm
Clustering Technqiues

Clustering is a type of unsupervised learning method of machine learning. In the unsupervised


learning method, the inferences are drawn from the data sets which do not contain labelled output
variable. It is an exploratory data analysis technique that allows us to analyze the multivariate data
sets.
Clustering is a task of dividing the data sets into a certain number of clusters in such a manner that
the data points belonging to a cluster have similar characteristics. Clusters are nothing but the
grouping of data points such that the distance between the data points within the clusters is minimal.

In other words, the clusters are regions where the density of similar data points is high. It is
generally used for the analysis of the data set, to find insightful data among huge data sets and draw
inferences from it. Generally, the clusters are seen in a spherical shape, but it is not necessary as the
clusters can be of any shape.

It depends on the type of algorithm we use which decides how the clusters will be created. The
inferences that need to be drawn from the data sets also depend upon the user as there is no criterion
for good clustering.

What are the types of Clustering Methods?


Clustering itself can be categorized into two types viz. Hard Clustering and Soft Clustering. In hard
clustering, one data point can belong to one cluster only. But in soft clustering, the output provided
is a probability likelihood of a data point belonging to each of the pre-defined numbers of clusters.
Density-Based Clustering

In this method, the clusters are created based upon the density of the data points which are
represented in the data space. The regions that become dense due to the huge number of data points
residing in that region are considered as clusters.

The data points in the sparse region (the region where the data points are very less) are considered
as noise or outliers. The clusters created in these methods can be of arbitrary shape. Following are
the examples of Density-based clustering algorithms:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN groups data points together based on the distance metric and criterion for a minimum
number of data points. It takes two parameters – eps and minimum points. Eps indicates how close
the data points should be to be considered as neighbors. The criterion for minimum points should be
completed to consider that region as a dense region.

OPTICS (Ordering Points to Identify Clustering Structure)

It is similar in process to DBSCAN, but it attends to one of the drawbacks of the former algorithm
i.e. inability to form clusters from data of arbitrary density. It considers two more parameters which
are core distance and reachability distance. Core distance indicates whether the data point being
considered is core or not by setting a minimum value for it.

Reachability distance is the maximum of core distance and the value of distance metric that is used
for calculating the distance among two data points. One thing to consider about reachability
distance is that its value remains not defined if one of the data points is a core point.

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with


Noise)

HDBSCAN is a density-based clustering method that extends the DBSCAN methodology by


converting it to a hierarchical clustering algorithm.

Hierarchical Clustering

Hierarchical Clustering groups (Agglomerative or also called as Bottom-Up Approach) or divides


(Divisive or also called as Top-Down Approach) the clusters based on the distance metrics. In
Agglomerative clustering, each data point acts as a cluster initially, and then it groups the clusters
one by one.
Divisive is the opposite of Agglomerative, it starts off with all the points into one cluster and
divides them to create more clusters. These algorithms create a distance matrix of all the existing
clusters and perform the linkage between the clusters depending on the criteria of the linkage. The
clustering of the data points is represented by using a dendrogram. There are different types of
linkages: –

o Single Linkage: – In single linkage the distance between the two clusters is the shortest
distance between points in those two clusters.

o Complete Linkage: – In complete linkage, the distance between the two clusters is the farthest
distance between points in those two clusters.

o Average Linkage: – In average linkage the distance between the two clusters is the average
distance of every point in the cluster with every point in another cluster.

Fuzzy Clustering

In fuzzy clustering, the assignment of the data points in any of the clusters is not decisive. Here, one
data point can belong to more than one cluster. It provides the outcome as the probability of the data
point belonging to each of the clusters. One of the algorithms used in fuzzy clustering is Fuzzy c-
means clustering.

This algorithm is similar in process to the K-Means clustering and it differs in the parameters that
are involved in the computation like fuzzifier and membership values.

Partitioning Clustering

This method is one of the most popular choices for analysts to create clusters. In partitioning
clustering, the clusters are partitioned based upon the characteristics of the data points. We need to
specify the number of clusters to be created for this clustering method. These clustering algorithms
follow an iterative process to reassign the data points between clusters based upon the distance. The
algorithms that fall into this category are as follows: –

o K-Means Clustering: – K-Means clustering is one of the most widely used algorithms. It
partitions the data points into k clusters based upon the distance metric used for the clustering. The
value of ‘k’ is to be defined by the user. The distance is calculated between the data points and the
centroids of the clusters.

The data point which is closest to the centroid of the cluster gets assigned to that cluster. After an
iteration, it computes the centroids of those clusters again and the process continues until a pre-
defined number of iterations are completed or when the centroids of the clusters do not change after
an iteration.

It is a very computationally expensive algorithm as it computes the distance of every data point with
the centroids of all the clusters at each iteration. This makes it difficult for implementing the same
for huge data sets.

PAM (Partitioning Around Medoids)

This algorithm is also called as k-medoid algorithm. It is also similar in process to the K-means
clustering algorithm with the difference being in the assignment of the center of the cluster. In PAM,
the medoid of the cluster has to be an input data point while this is not true for K-means clustering
as the average of all the data points in a cluster may not belong to an input data point.

o CLARA (Clustering Large Applications): – CLARA is an extension to the PAM algorithm


where the computation time has been reduced to make it perform better for large data sets. To
accomplish this, it selects a certain portion of data arbitrarily among the whole data set as a
representative of the actual data. It applies the PAM algorithm to multiple samples of the data and
chooses the best clusters from a number of iterations.

Grid-Based Clustering

In grid-based clustering, the data set is represented into a grid structure which comprises of grids
(also called cells). The overall approach in the algorithms of this method differs from the rest of the
algorithms.

They are more concerned with the value space surrounding the data points rather than the data
points themselves. One of the greatest advantages of these algorithms is its reduction in
computational complexity. This makes it appropriate for dealing with humongous data sets.

After partitioning the data sets into cells, it computes the density of the cells which helps in
identifying the clusters. A few algorithms based on grid-based clustering are as follows: –

o STING (Statistical Information Grid Approach): – In STING, the data set is divided
recursively in a hierarchical manner. Each cell is further sub-divided into a different number of
cells. It captures the statistical measures of the cells which helps in answering the queries in a small
amount of time.

o WaveCluster: – In this algorithm, the data space is represented in form of wavelets. The data
space composes an n-dimensional signal which helps in identifying the clusters. The parts of the
signal with a lower frequency and high amplitude indicate that the data points are concentrated.
These regions are identified as clusters by the algorithm. The parts of the signal where the
frequency high represents the boundaries of the clusters.

o CLIQUE (Clustering in Quest): – CLIQUE is a combination of density-based and grid-based


clustering algorithm. It partitions the data space and identifies the sub-spaces using the Apriori
principle. It identifies the clusters by calculating the densities of the cells.

o PROCLUS:- The PROCLUS algorithm uses a top-down approach which creates clusters that
are partitions of the data sets, where each data point is assigned to only one cluster which is highly
suitable for customer segmentation and trend analysis where a partition of points is required.

Frequent pattern based Clustering Method

Typical examples of frequent pattern–based cluster analysis include the clustering of text
documents that contain thousands of distinct keywords, and the analysis of microarray data that
contain tens of thousands of measured values or “features.”Discovering clusters in subspaces, or
subspace clustering and related clustering paradigms, is a research field where we find many
frequent pattern mining related influences. In fact, as the first algorithms for subspace clustering
were based on frequent pattern mining algorithms, it is fair to say that frequent pattern mining was
at the cradle of subspace clustering—yet, it quickly developed into an independent research field.

What Is Frequent Pattern Analysis?

•Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently
in a data set

•First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsetsand
association rule mining

•Motivation: Finding inherent regularities in data

–What products were often purchased together?—Beer and diapers?!–

What are the subsequent purchases after buying a PC?

–What kinds of DNA are sensitive to this new drug?

–Can we automatically classify web documents?

•Applications–Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web
log (click stream) analysis, and DNA sequence analysis.

Why Is Freq. Pattern Mining Important?

•Freq. pattern: An intrinsic and important property of datasets


•Foundation for many essential data mining tasks

–Association, correlation, and causality analysis

–Sequential, structural (e.g., sub-graph) patterns

–Pattern analysis in spatiotemporal, multimedia, time-series, and stream data

–Classification: discriminative, frequent pattern analysis

–Cluster analysis: frequent pattern-based clustering

–Data warehousing: iceberg cube and cube-gradient

–Semantic data compression: fascicles

–Broad applications

Hierarchical Clustering in Non-Euclidean spaces:

Main problem: We use distance measures such as mentioned at the beginning. So we can’t base
distances on location of points.The problem arises when we need to represent a cluster, Because we
cannot replace a location point by centriod point.

Example:Suppose we use edit distance, soBut there is no string represents their


average.Solution:We pick one of the points in the cluster itself to represent the cluster. This point
should be selected as close to all the points in the cluster, so it represent some kind of “center”.We
call the representative point Clustroid.

Selecting the clustroid.

There are few ways of selecting the clustroid point:

Select as clustroid the point that minimize:

1.The sum of the distances to the other points in the cluster.

2.The maximum distance to another point in the cluster.

3.The sum of the squares of the distances to the other points in the cluster.

Stopping criterion:

•Uses criterions not directly using centroids, except the radius which is valid also to Non-Euclidean
spaces.

•So all criterions may be used for Non-Euclidean spaces as well

Clustering for streams and parallelism.


Research on Parallel Data Stream Clustering Algorithm Based on Grid and Density. ... The
algorithm adopts density threshold function to deal with the noise points and inspect and remove
them periodically. It also can find clusters of arbitrary shape in large-scale data flow in real-time.

In recent years, the management and processing of so-called data streams has become a topic of
active research in several fields of computer science such as, e.g., distributed systems, database
systems, and data mining. A data stream can roughly be thought of as a transient, continuously
increasing sequence of time-stamped data. In this paper, we consider the problem of clustering
parallel streams of real-valued data, that is to say, continuously evolving time series. In other words,
we are interested in grouping data streams the evolution over time of which is similar in a specific
sense. In order to maintain an up-to-date clustering structure, it is necessary to analyze the incoming
data in an online manner, tolerating not more than a constant time delay.
Unit-V

Frame Works and Visualization

MapReduce
MapReduce is a framework using which we can write applications to process huge amounts of data,
in parallel, on large clusters of commodity hardware in a reliable manner.

What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes
a set of data and converts it into another set of data, where individual elements are broken down
into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input
and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called mappers
and reducers. Decomposing a data processing application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the MapReduce form, scaling the application to run
over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a
configuration change. This simple scalability is what has attracted many programmers to use the
MapReduce model.
The Algorithm

• Generally MapReduce paradigm is based on sending the computer to where the data resides!

• MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
• Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
• Reduce stage − This stage is the combination of the Shuffle stage and the Reduce
stage. The Reducer’s job is to process the data that comes from the mapper. After
processing, it produces a new set of output, which will be stored in the HDFS.
• During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
• The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
• Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
• After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.

Hadoop
Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to scale
up from single servers to thousands of machines, each offering local computation and storage.

Hadoop is an open source, Java based framework used for storing and processing big data. The data
is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables
concurrent processing and fault tolerance. Developed by Doug Cutting and Michael J. Cafarella,
Hadoop uses the MapReduce programming model for faster storage and retrieval of data from its
nodes. The framework is managed by Apache Software Foundation and is licensed under the
Apache License 2.0.

For years, while the processing power of application servers has been increasing manifold,
databases have lagged behind due to their limited capacity and speed. However, today, as many
applications are generating big data to be processed, Hadoop plays a significant role in providing a
much-needed makeover to the database world.
The 4 Modules of Hadoop

Hadoop is made up of "modules", each of which carries out a particular task essential for a
computer system designed for big data analytics.
1. Distributed File-System
The most important two are the Distributed File System, which allows data to be stored in an easily
accessible format, across a large number of linked storage devices, and the MapReduce - which
provides the basic tools for poking around in the data.
(A "file system" is the method used by a computer to store data, so it can be found and used.
Normally this is determined by the computer's operating system, however a Hadoop system uses its
own file system which sits "above" the file system of the host computer - meaning it can be
accessed using any computer running any supported OS).
2. MapReduce
MapReduce is named after the two basic operations this module carries out - reading data from the
database, putting it into a format suitable for analysis (map), and performing mathematical
operations i.e counting the number of males aged 30+ in a customer database (reduce).
3. Hadoop Common
The other module is Hadoop Common, which provides the tools (in Java) needed for the user's
computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.
4. YARN
The final module is YARN, which manages resources of the systems storing the data and running
the analysis.
Various other procedures, libraries or features have come to be considered part of the Hadoop
"framework" over recent years, but Hadoop Distributed File System, Hadoop MapReduce, Hadoop
Common and Hadoop YARN are the principle four.
How Hadoop Came About
Development of Hadoop began when forward-thinking software engineers realised that it was
quickly becoming useful for anybody to be able to store and analyze datasets far larger than can
practically be stored and accessed on one physical storage device (such as a hard disk).
This is partly because as physical storage devices become bigger it takes longer for the component
that reads the data from the disk (which in a hard disk, would be the "head") to move to a specified
segment. Instead, many smaller devices working in parallel are more efficient than one large one.
It was released in 2005 by the Apache Software Foundation, a non-profit organization which
produces open source software which powers much of the Internet behind the scenes. And if you're
wondering where the odd name came from, it was the name given to a toy elephant belonging to the
son of one of the original creators!
The Usage of Hadoop
The flexible nature of a Hadoop system means companies can add to or modify their data system as
their needs change, using cheap and readily-available parts from any IT vendor.
Today, it is the most widely used system for providing data storage and processing across
"commodity" hardware - relatively inexpensive, off-the-shelf systems linked together, as opposed to
expensive, bespoke systems custom-made for the job in hand. In fact it is claimed that more than
half of the companies in the Fortune 500 make use of it.
Just about all of the big online names use it, and as anyone is free to alter it for their own purposes,
modifications made to the software by expert engineers at, for example, Amazon and Google, are
fed back to the development community, where they are often used to improve the "official"
product. This form of collaborative development between volunteer and commercial users is a key
feature of open source software.
In its "raw" state - using the basic modules supplied here http://hadoop.apache.org/ by Apache, it
can be very complex, even for IT professionals - which is why various commercial versions have
been developed such as Cloudera which simplify the task of installing and running a Hadoop
system, as well as offering training and support services.
So that, in a (fairly large) nutshell, is Hadoop. Thanks to the flexible nature of the system,
companies can expand and adjust their data analysis operations as their business expands. And the
support and enthusiasm of the open source community behind it has led to great strides towards
making big data analysis more accessible for everyone.
Pig:

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to
write complex data transformations without knowing Java. ... Pig works with data from many
sources, including structured and unstructured data, and store the results into the Hadoop Data File
System.

What is Apache Pig?

Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger
sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all
the data manipulation operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin. This
language provides various operators using which programmers can develop their own functions for
reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All
these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known
as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce
jobs.

Why Do We Need Apache Pig?


Programmers who are not so good at Java normally used to struggle working with Hadoop,
especially while performing any MapReduce tasks. Apache Pig is a boon for all such programmers.
• Using Pig Latin, programmers can perform MapReduce tasks easily without having to type
complex codes in Java.
Apache Pig uses multi-query approach, thereby reducing the length of codes. For example, an
operation that would require you to type 200 lines of code (LoC) in Java can be easily done by
typing as less as just 10 LoC in Apache Pig. Ultimately Apache Pig reduces the development time
by almost 16 times.

• Pig Latin is SQL-like language and it is easy to learn Apache Pig when you are familiar
with SQL.
• Apache Pig provides many built-in operators to support data operations like joins, filters,
ordering, etc. In addition, it also provides nested data types like tuples, bags, and maps that
are missing from MapReduce.

Features of Pig
Apache Pig comes with the following features −
• Rich set of operators − It provides many operators to perform operations like join, sort,
filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if
you are good at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their execution
automatically, so the programmers need to focus only on semantics of the language.
• Extensibility − Using the existing operators, users can develop their own functions to read,
process, and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well
as unstructured. It stores the results in HDFS.
Apache Pig Vs MapReduce

Apache Pig Vs SQL


Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates

Features of Hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.

Architecture of Hive
The following component diagram depicts the architecture of Hive:

Hbase
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an
open-source project and is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick random access
to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File
System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the
data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides
read and write access.

HBase and HDFS


MapR

MapR was a business software company headquartered in Santa Clara, California. MapR software
provides access to a variety of data sources from a single computer cluster, including big data
workloads such as Apache Hadoop and Apache Spark, a distributed file system, a multi-model
database management system, and event stream processing, combining analytics in real-time with
operational applications. Its technology runs on both commodity hardware and public cloud
computing services.

Sharding

Sharding is the process of breaking up large tables into smaller chunks called shards that are spread
across multiple servers. A shard is essentially a horizontal data partition that contains a subset of
the total data set, and hence is responsible for serving a portion of the overall workload.Sharding is
the process of breaking up large tables into smaller chunks called shards that are spread across
multiple servers. A shard is essentially a horizontal data partition that contains a subset of the total
data set, and hence is responsible for serving a portion of the overall workload.

Why Shard a Database?

Business applications that rely on a monolithic RDBMS hit bottlenecks as they grow. With limited
CPU, storage capacity, and memory, query throughput and response times are bound to suffer.
When it comes to adding resources to support database operations, vertical scaling (aka scaling up)
has its own set of limits and eventually reaches a point of diminishing returns.

On the other hand, horizontally partitioning a table means more compute capacity to serve incoming
queries, and therefore you end up with faster query response times and index builds. By
continuously balancing the load and data set over additional nodes, sharding also enables usage of
additional capacity. Moreover, a network of smaller, cheaper servers may be more cost effective in
the long term than maintaining one big server.
Besides resolving scaling challenges, sharding can potentially alleviate the impact of unplanned
outages. During downtime, all the data in an unsharded database is inaccessible, which can be
disruptive or downright disastrous. When done right, sharding can ensure high availability: even if
one or two nodes hosting a few shards are down, the rest of the database is still available for
read/write operations as long as the other nodes (hosting the remaining shards) run in different
failure domains. Overall, sharding can increase total cluster storage capacity, speed up processing,
and offer higher availability at a lower cost than vertical scaling.
NOSQL

A NoSQL database provides a mechanism for storing and retrieving data that is modeled in ways
other than the those used in relational databases and RDBMS. Such databases have existed since the
late 1960s, but were not called NoSQL until a surge in popularity during the early 2000s, triggered
by the needs of Web 2.0 companies such as Facebook, Google, and Amazon.

NoSQL databases are increasingly used in big data and real-time web applications. Hadoop enables
certain types of NoSQL distributed databases (such as HBase), which allow data to be spread across
thousands of servers with little reduction in performance. Modern non-relational and cloud
databases now make up 70-percent of data sources for analytics. And it's becoming common for
companies to gain a deeper understanding of their customers by querying NoSQL data and
combining it with data, including unstructured data, residing in Salesforce and Web transaction data
in Hadoop.
NoSQL is an approach to database design that can accommodate a wide variety of data models,
including key-value, document, columnar and graph formats. NoSQL, which stands for “not only
SQL,” is an alternative to traditional relational databases in which data is placed in tables and data
schema is carefully designed before the database is built. NoSQL databases are especially useful for
working with large sets of distributed data.

The NoSQL term can be applied to some databases that predated the relational database
management system (RDBMS), but it more commonly refers to the databases built in the early
2000s for the purpose of large-scale database clustering in cloud and web applications. In these
applications, requirements for performance and scalability outweighed the need for the immediate,
rigid data consistency that the RDBMS provided to transactional enterprise applications.
NoSQL helps deal with the volume, variety, and velocity requirements of big data:
Volume: Maintaining the ACID properties (Atomicity, Consistency, Isolation, Durability) is
expensive and not always necessary. Sometimes, we can deal with minor inconsistencies in our
results. We thus want to be able to partition our data multiple sites.
• Variety: One single fixed data model makes it harder to incorporate varying data.
Sometimes, when we pull from external sources, we don’t know the schema! Furthermore,
changing a schema in a relational database can be expensive.
• Velocity: Storing everything durable to a disk all the time can be prohibitively expensive.
Sometimes it’s okay if we have a low probability of losing data. Memory is much cheaper
now, and much faster than always going to disk.
There is no single accepted definition of NoSQL, but here are its main characteristics:
• It has quite a flexible schema, unlike the relational model. Different rows may have
different attributes or structure. The database often has no understanding of the schema. It is
up to the applications to maintain consistency in the schema including any denormalization.
• It also is often better at handling really big data tasks. This is because NoSQL databases
follow the BASE (Basically Available, Soft state, Eventual consistency) approach instead of
ACID.
• In NoSQL, consistency is only guaranteed after some period of time when writes stop. This
means it is possible that queries will not see the latest data. This is commonly implemented
by storing data in memory and then lazily sending it to other machines.
• Finally, there is this notion known as the CAP theorem — pick 2 out of 3 things:
Consistency, Availability, and Partition tolerance. ACID databases are usually CP systems,
while BASE databases are usually AP. This distinction is blurry and often systems can be
reconfigured to change these tradeoffs.
S3

By using Amazon S3 analytics Storage Class Analysis you can analyze storage access patterns to
help you decide when to transition the right data to the right storage class. This new Amazon S3
analytics feature observes data access patterns to help you determine when to transition less
frequently accessed STANDARD storage to the STANDARD_IA (IA, for infrequent access)
storage class.

After storage class analysis observes the infrequent access patterns of a filtered set of data over a
period of time, you can use the analysis results to help you improve your lifecycle policies. You can
configure storage class analysis to analyze all the objects in a bucket. Or, you can configure filters
to group objects together for analysis by common prefix (that is, objects that have names that begin
with a common string), by object tags, or by both prefix and tags. You'll most likely find that
filtering by object groups is the best way to benefit from storage class analysis.
You can have multiple storage class analysis filters per bucket, up to 1,000, and will receive a
separate analysis for each filter. Multiple filter configurations allow you analyze specific groups of
objects to improve your lifecycle policies that transition objects to STANDARD_IA.
Storage class analysis provides storage usage visualizations in the Amazon S3 console that are
updated daily. You can also export this daily usage data to an S3 bucket and view them in a
spreadsheet application, or with business intelligence tools, like Amazon QuickSight.
HADOOP Distributed File System
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-
cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files
are stored across multiple machines. These files are stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS also makes applications available to parallel
processing.

Features of HDFS
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to easily check the status of
cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.

HDFS Architecture
Given below is the architecture of a Hadoop File System.

DFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system and the
namenode software. It is a software that can be run on commodity hardware. The system having the
namenode acts as the master server and it does the following tasks −
• Manages the file system namespace.

• Regulates client’s access to files.

• It also executes file system operations such as renaming, closing, and opening files and
directories.

Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode.
These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as per client request.

• They also perform operations such as block creation, deletion, and replication according to
the instructions of the namenode.

Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual data nodes. These file segments are called as
blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change in HDFS
configuration.

Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS should have mechanisms for quick and
automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications
having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the network traffic and
increases the throughput.

Visualization: Visual Data Analysis Techniques

Visualization is the first step to make sense of data. To


translate and present data and data correlations in a
simple way, data analysts use a wide range of
techniques — charts, diagrams, maps, etc. Choosing the
right technique and its setup is often the only way to
make data understandable. Vice versa, poorly selected
tactics won't let to unlock the full potential of data or
even make it irrelevant.

5 factors that influence data visualization choices:


1. Audience. It’s important to adjust data representation to the specific target audience. For
example, fitness mobile app users who browse through their progress can easily work with
uncomplicated visualizations. On the other hand, if data insights are intended for researchers
and experienced decision-makers who regularly work with data, you can and often have to
go beyond simple charts.
2. Content. The type of data you are dealing with will determine the tactics. For example, if
it’s time-series metrics, you will use line charts to show the dynamics in many cases. To
show the relationship between two elements, scatter plots are often used. In turn, bar charts
work well for comparative analysis.
3. Context. You can use different data visualization approaches and read data depending on the
context. To emphasize a certain figure, for example, significant profit growth, you can use
the shades of one color on the chart and highlight the highest value with the brightest one.
On the contrary, to differentiate elements, you can use contrast colors.
4. Dynamics. There are various types of data, and each type has a different rate of change. For
example, financial results can be measured monthly or yearly, while time series and tracking
data are changing constantly. Depending on the rate of change, you may consider dynamic
representation (steaming) or static visualization techniques in data mining.
5. Purpose. The goal of data visualization affects the way it is implemented. In order to make a
complex analysis, visualizations are compiled into dynamic and controllable dashboards that
work as visual data analysis techniques and tools. However, dashboards are not necessary to
show a single or occasional data insight.

Data visualization techniques


Depending on these factors, you can choose different data visualization techniques and configure
their features. Here are the common types of visualization techniques:

Charts
The easiest way to show the development of one or several data sets is a chart. Charts vary from bar
and line charts that show the relationship between elements over time to pie charts that demonstrate
the components or proportions between the elements of one whole.
Plots
Plots allow to distribute two or more data sets over a 2D or even 3D space to show the relationship
between these sets and the parameters on the plot. Plots also vary. Scatter and bubble plots are some
of the most widely-used visualizations. When it comes to big data, analysts often use more complex
box plots that help visualize the relationship between large volumes of data.

Maps
Maps are popular ways to visualize data used in different industries. They allow to locate elements
on relevant objects and areas — geographical maps, building plans, website layouts, etc. Among the
most popular map visualizations are heat maps, dot distribution maps, cartograms.
Diagrams and matrices
Diagrams are usually used to demonstrate complex data relationships and links and include various
types of data on one visualization. They can be hierarchical, multidimensional, tree-like.
Matrix is one of the advanced data visualization techniques that help determine the correlation
between multiple constantly updating (steaming) data sets.

Interaction Techniques

Interaction techniques essentially involve data entry and manipulation, and thus place greater
emphasis on input than output. Output is merely used to convey affordances and provide user
feedback. The use of the term input technique further reinforces the central role of input.

Interactive data visualization refers to the use of modern data analysis software that enables users to
directly manipulate and explore graphical representations of data. Data visualization uses visual
aids to help analysts efficiently and effectively understand the significance of data. Interactive data
visualization software improves upon this concept by incorporating interaction tools that facilitate
the modification of the parameters of a data visualization, enabling the user to see more detail,
create new insights, generate compelling questions, and capture the full value of the data.

Interactive Data Visualization Techniques

Deciding what the best interactive data visualization will be for your project depends on your end
goal and the data available. Some common data visualization interactions that will help users
explore their data visualizations include:

• Brushing: Brushing is an interaction in which the mouse controls a paintbrush that directly
changes the color of a plot, either by drawing an outline around points or by using the brush
itself as a pointer. Brushing scatterplots can either be persistent, in which the new
appearance is retained once the brush has been removed, or transient, in which changes only
remain visible while the active plot is enclosed or intersected by the brush. Brushing is
typically used when multiple plots are visible and a linking mechanism exists between the
plots.
• Painting: Painting refers to the use of persistent brushing, followed by subsequent
operations such as touring to compare the groups.
• Identification: Identification, also known as label brushing or mouseover, refers to the
automatic appearance of an identifying label when the cursor hovers over a particular plot
element.
• Scaling: Scaling can be used to change a plot’s aspect ratio, revealing different data features.
Scaling is also commonly used to zoom in on dense regions of a scatter plot.
• Linking: Linking connects selected elements on different plots. One-to-one linking entails
the projection of data on two different plots, in which a point in one plot corresponds to
exactly one point in the other. Elements may also be categorical variables, in which all data
values corresponding to that category are highlighted in all the visible plots. Brushing an
area in one plot will brush all cases in the corresponding category on another plot.
System and application in data visualization

Data Visualization Application lets you quickly create insightful data visualizations, in minutes. It
allows users to visualize data using drag & drop, create interactive dashboards and customize them
with a few clicks. Data visualization tools allow anyone to organize and present information
intuitively. They enables users to share data visualizations with others. People can create interactive
data visualizations to understand data, ask business questions and find answers quickly.

Introduction to R
R is a command line driven program. The user enters commands at the prompt ( > by default ) and
each command is executed one at a time.
There have been a number of attempts to create a more graphical interface, ranging from code
editors that interact with R, to full-blown GUIs that present the user with menus and dialog boxes.
data import and export in R

# Use readxl package to read xls|xlsx


library("readxl")
my_data <- read_excel("my_file.xlsx")
# Use xlsx package
library("xlsx")
my_data <- read.xlsx("my_file.xlsx")

library("readr")
# Read tab separated values
read_tsv(file.choose())
# Read comma (",") separated values
read_csv(file.choose())
# Read semicolon (";") separated values
read_csv2(file.choose())

# Read tab separated values


read.delim(file.choose())
# Read comma (",") separated values
read.csv(file.choose())
# Read semicolon (";") separated values
read.csv2(file.choose())

Exporting data from R


# Loading mtcars data
data("mtcars")
# Write data to txt file: tab separated values
# sep = "\t"
write.table(mtcars, file = "mtcars.txt", sep = "\t",
row.names = TRUE, col.names = NA)
# Write data to csv files:
# decimal point = "." and value separators = comma (",")
write.csv(mtcars, file = "mtcars.csv")
# Write data to csv files:
# decimal point = comma (",") and value separators = semicolon
(";")
write.csv2(mtcars, file = "mtcars.csv")

# Loading mtcars data


data("mtcars")
library("readr")
# Writing mtcars data to a tsv file
write_tsv(mtcars, path = "mtcars.txt")
# Writing mtcars data to a csv file
write_csv(mtcars, path = "mtcars.csv")

library("xlsx")
# Write the first data set in a new workbook
write.xlsx(USArrests, file = "myworkbook.xlsx",
sheetName = "USA-ARRESTS", append = FALSE)
# Add a second data set in a new worksheet
write.xlsx(mtcars, file = "myworkbook.xlsx",
sheetName="MTCARS", append=TRUE)
Attribute and data types in R

There are many types of R-objects. The frequently used ones


are −

• Vectors
• Lists
• Matrices
• Arrays
• Factors
• Data Frames

The simplest of these objects is the vector object and there are six data types of these atomic
vectors, also termed as six classes of vectors.
Vectors
When you want to create vector with more than one element, you should use c() function which
means to combine the elements into a vector.

# Create a vector.
apple <- c('red','green',"yellow")
print(apple)

# Get the class of the vector.


print(class(apple))

Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.

# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

# Print the list.


print(list1)

Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the
matrix function.
Live Demo
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)

Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The
array function takes a dim attribute which creates the required number of dimension. In the below
example we create an array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the distinct
values of the elements in the vector as labels. The labels are always character irrespective of
whether it is numeric or character or Boolean etc. in the input vector. They are useful in statistical
modeling.
Factors are created using the factor() function. The nlevels functions gives the count of levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object.


factor_apple <- factor(apple_colors)

# Print the factor.


print(factor_apple)
print(nlevels(factor_apple))

Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be character
and third column can be logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)

Descriptive Statistics in R

dat <- iris # load the iris dataset and renamed it dat

Below a preview of this dataset and its structure:


head(dat) # first 6 observations

str(dat) # structure of dataset


Minimum and maximum
Minimum and maximum can be found thanks to the min() and max() functions:
min(dat$Sepal.Length)

max(dat$Sepal.Length)

Alternatively the range() function:

rng <- range(dat$Sepal.Length)


rng

## [1] 4.3 7.9

Range
The range can then be easily computed, as you have guessed, by subtracting the minimum from the
maximum:
max(dat$Sepal.Length) - min(dat$Sepal.Length)

## [1] 3.6

Mean
The mean can be computed with the mean() function:
mean(dat$Sepal.Length)

## [1] 5.843333

Median
The median can be computed thanks to the median() function:
median(dat$Sepal.Length)

## [1] 5.8
First and third quartile
As the median, the first and third quartiles can be computed thanks to the quantile() function
and by setting the second argument to 0.25 or 0.75:
quantile(dat$Sepal.Length, 0.25) # first quartile

## 25%
## 5.1

quantile(dat$Sepal.Length, 0.75) # third quartile

## 75%
## 6.4

You may have seen that the results above are slightly different than the results you would have
found if you compute the first and third quartiles by hand. It is normal, there are many methods to
compute them (R actually has 7 methods to compute the quantiles!). However, the methods
presented here and in the article “descriptive statistics by hand” are the easiest and most “standard”
ones. Furthermore, results do not dramatically change between the two methods.

Other quantiles
As you have guessed, any quantile can also be computed with the quantile() function. For
instance, the 4th
decile or the 98th

percentile:
quantile(dat$Sepal.Length, 0.4) # 4th decile

## 40%
## 5.6

quantile(dat$Sepal.Length, 0.98) # 98th percentile

## 98%
## 7.7

Interquartile range
The interquartile range (i.e., the difference between the first and third quartile) can be computed
with the IQR() function:
IQR(dat$Sepal.Length)

## [1] 1.3
or alternatively with the quantile() function again:
quantile(dat$Sepal.Length, 0.75) - quantile(dat$Sepal.Length,
0.25)

## 75%
## 1.3

As mentioned earlier, when possible it is usually recommended to use the shortest piece of code to
arrive at the result. For this reason, the IQR() function is preferred to compute the interquartile
range.

Standard deviation and variance


The standard deviation and the variance is computed with the sd() and var() functions:
sd(dat$Sepal.Length) # standard deviation

## [1] 0.8280661

var(dat$Sepal.Length) # variance

## [1] 0.6856935

standard deviation and the variance are different whether we compute it for a sample or a
population (see the difference between sample and population). In R, the standard deviation and the
variance are computed as if the data represent a sample (so the denominator is n−1
, where n is the number of observations). To my knowledge, there is no function by default in R that
computes the standard deviation or variance for a population.
Tip: to compute the standard deviation (or variance) of multiple variables at the same time, use
lapply() with the appropriate statistics as second argument:
lapply(dat[, 1:4], sd)

## $Sepal.Length
## [1] 0.8280661
##
## $Sepal.Width
## [1] 0.4358663
##
## $Petal.Length
## [1] 1.765298
##
## $Petal.Width
## [1] 0.7622377

Exploratory data analysis in R


Exploratory Data Analysis (EDA) is the process of analyzing and visualizing the data to get a
better understanding of the data and glean insight from it. There are various steps involved when
doing EDA but the following are the common steps that a data analyst can take when performing
EDA:

1. Import the data


2. Clean the data
3. Process the data
4. Visualize the data

Import the data


These are 3 ways of importing the data into R. Usually, one with go for the df.raw1 because it
seems to be the most convenient way of importing the data. Let’s see the structure of the imported
data:

df.raw1 <- read.csv(file ='Pisa scores 2013 - 2015 Data.csv')


str(df.raw1)

There are 2 problems that we can spot immediately. The last column is ‘factor’ and not ‘numeric’
like what we desire. Secondly, the first column ‘Country name’ is encoded differently from the raw
dataset.

Now let’s try the second case scenario


df.raw2 <- read.csv(file ='Pisa scores 2013 - 2015 Data.csv',na.strings = '..')

str(df.raw2)
What about the last scenario?

df.raw <- read.csv(file ='Pisa scores 2013 - 2015 Data.csv', fileEncoding="UTF-8-BOM", na.strings
= '..')
str(df.raw)

Cleaning and Processing the data


install.packages("tidyverse")
library(tidyverse)

We want to do a few things to clean the dataset:


1. Make sure that each row in the dataset corresponds to ONLY one country: Use spread()
function in tidyverse package
2. Make sure that only useful columns and rows are kept: Use drop_na() and data subsetting
3. Rename the Series Code column for meaningful interpretation: Use rename()
df <- df.raw[1:1161, c(1, 4, 7)] #select relevant rows and cols
%>% spread(key=Series.Code, value=X2015..YR2015.)
%>% rename(Maths = LO.PISA.MAT,
Maths.F = LO.PISA.MAT.FE,
Maths.M = LO.PISA.MAT.MA,
Reading = LO.PISA.REA,
Reading.F = LO.PISA.REA.FE,
Reading.M = LO.PISA.REA.MA,
Science = LO.PISA.SCI,
Science.F = LO.PISA.SCI.FE,
Science.M = LO.PISA.SCI.MA
) %>%
drop_na()

Now let’s see how the clean data looks like:


view(df)
Visualizing the data
1. Barplot
install.packages("ggplot2")
library(ggplot2)#Ranking of Maths Score by
Countriesggplot(data=df,aes(x=reorder(Country.Name,Maths),y=Maths)) +
geom_bar(stat ='identity',aes(fill=Maths))+
coord_flip() +
theme_grey() +
scale_fill_gradient(name="Maths Score Level")+
labs(title = 'Ranking of Countries by Maths Score',
y='Score',x='Countries')+
geom_hline(yintercept = mean(df$Maths),size = 1, color = 'blue'
Boxplot

If we use the dataset above, we will not be able to draw a boxplot. This is because boxplot needs
only 2 variables x and y but in the cleaned data that we have, there are so many variables. So we
need to combine those into 2 variables. We name this as df2
df2 = df[,c(1,3,4,6,7,9,10)] %>% # select relevant columns
pivot_longer(c(2,3,4,5,6,7),names_to = 'Score')view(df2)

ggplot(data = df2, aes(x=Score,y=value, color=Score)) +

geom_boxplot()+
scale_color_brewer(palette="Dark2") +
geom_jitter(shape=16, position=position_jitter(0.2))+
labs(title = 'Did males perform better than females?',
y='Scores',x='Test Type')
Correlation Plot

df = df[,c(1,3,4,6,7,9,10)] #select relevant columns

To create correlation plot, simply use cor():


res = cor(df[,-1]) # -1 here means we look at all columns except the first columnres

analytics for unstructured data


Unstructured data analysis is the process of using data analytics tools to automatically organize,
structure and get value from unstructured data (information that is not organized in a pre-defined
manner). By performing unstructured data analysis, businesses can get valuable insights from their
data to perform informed decisions.

The vast majority of data that businesses deal with these days is unstructured. In fact, IDG Research
estimates that 85% of all data will be unstructured by 2025. There are huge insights to be gathered
from this data, but they’re hard to draw out.
Once you learn how to break down unstructured data and analyze it, however, you can perform
unstructured data analytics automatically, with little need for human input.
Unstructured data has no set framework or regular design. It is usually qualitative data, like images,
audio, and video, but most of it is unstructured text data: documents, social media data, emails,
open-ended surveys, etc.
Unstructured text data goes beyond just numerical values and facts, into thoughts, opinions, and
emotions. It can be analyzed to provide both quantitative and qualitative results: follow market
trends, monitor brand reputation, understand the voice of the customer (VoC), and more.

Unstructured Data Analytics Tools


Unstructured data analytics tools use machine learning to gather and analyze data that has no pre-
defined framework – like human language. Natural language processing (NLP) allows software to
understand and analyze text for deep insights, much as a human would.
Unstructured data analysis can help your business answer more than just the “What is happening?”
of numbers and statistics and go into qualitative results to understand, “Why is this happening?”
MonkeyLearn is a SaaS platform with powerful text analysis tools to pull real-world and real-time
insights from your unstructured information, whether it’s public data from the internet,
communications between your company and your customers, or almost any other source.
Among the most common and most useful tools for unstructured data analysis are:
• Sentiment analysis to automatically classify text by sentiment (positive, negative, neutral)
and read for the opinion and emotion of the writer.
• Keyword extraction to pull the most used and most important keywords from text: find
recurring themes and summarize whole pages of text.
• Intent and email classification to understand the intent of a comment or query and
automatically review emails for level of interest.

How to Analyze Unstructured Data


Tips to Analyze Unstructured Data:
1. Start with your end goal in mind
2. Collect unstructured data
3. Clean unstructured data
4. Structure unstructured data
5. Perform unstructured data analytics
6. Visualize your data analytics
7. Draw conclusions

1. Start with your end goal in mind


Are you looking to follow trends in the market or do you just need a number or statistic to assess
sales or growth? Do you need to evaluate open-ended surveys or automatically read and route
customer support tickets? Do you want to do social listening to find out what customers (and the
public at large) are saying about your brand and compare it to your competition?
Start with a solid idea of what you want to accomplish. Text analysis methods, like keyword
extraction, sentiment analysis, and topic classification, allow you to pull opinions and ideas from
text, then organize and analyze them more thoroughly for quantitative and qualitative results, so the
possibilities are vast.

2. Collect unstructured data


Once you’ve decided what you want to accomplish, you need to find your data. Make sure to use
data sources that are relevant to your topic and the goals you set, like customer surveys and online
reviews.
Whatever technique you use, make sure no data is lost. Databases and data warehouses can provide
access to structured data. But “data lakes” – repositories that store data in its raw format – offer
better access to unstructured data and retain all useful information.
Tools like MonkeyLearn allow you to connect directly to Twitter or pull data from other social
media sites, news articles, etc. As data moves fast in our current business climate, you’ll want to
learn how to collect real-time data to stay on top of your brand image.
You can use integrations with programs you may already use, like Google Sheets, Zapier, Zendesk,
Rapidminer, SurveyMonkey, and more. Or use web scraping tools, like ScrapeStorm, Content
Grabber, and Pattern.
You can collect emails, voice recordings, chatbot data, news reports, product reviews – unstructured
data is practically endless.

3. Clean unstructured data


Unstructured text data often comes with repetitive text or irrelevant text and symbols, like email
signatures, URL links, emojis, banner ads, etc. This information is unnecessary to your analysis and
will only skew the results, so it’s important you learn how to clean your data.
You can start with some simple word processing tasks, like running spell check, removing
repetitious words, special characters, and URL links, or give a quick read to make sure words are
used correctly.
MonkeyLearn offers several models to save time and make data cleaning easy. The email cleaner
automatically removes signatures, legal clauses, and previous replies from within a thread, so you’ll
end up with only the most recent reply:
The boilerplate extractor extracts only relevant text from HTML. You can use it on websites or
emails to remove clutter, like templates, navigation bars, ads, etc.
And the opinion units extractor can break sentences or entire pages into individual thoughts or
statements called “opinion units”:
It can automatically go to work on hundreds of pages of text in a single go to get your data prepped
and ready for analysis.

4. Structure your unstructured data


Text analysis machine learning programs use natural language processing algorithms to break down
unstructured text data. Data preparation techniques like tokenization, part-of-speech tagging,
stemming, and lemmatization effectively transform unstructured text into a format that can be
understood by machines. This is then compared to similarly prepared data in search of patterns and
deviations in order to make interpretations.
This can all be done in just seconds using machine learning tools, like MonkeyLearn.

5. Analyze your unstructured data


Once the data is structured, you're ready for analysis. Depending on your goals, you can calculate
whatever metrics you need. SaaS tools allow you to pick and choose from many different extraction
and classification techniques and use them in concert to get a view of the big picture or super
minute details.
Maybe you’re following a new product launch or marketing campaign and you need to know how
customers feel about it. You can extract data from social media posts or online reviews relating only
to the subject you need, perform sentiment analysis on them, and follow the sentiment over time.

6. Visualize your analysis results


Creating charts and graphs to visualize your data can make analyses much easier to comprehend
and compare. MonkeyLearn Studio is an all-in-one business intelligence platform where you can
perform all of the above in one single interface, and then visualize your results in striking detail for
an interactive data experience.
MonkeyLearn Studio offers templates (or you can design your own) with multiple text analyses
chained together.
1. Simply choose a template
2. Upload your data
3. Run the analysis
4. Automatically visualize the results.

7. Draw conclusions from the results


When you can see all your results together, it’s easy to make data-driven decisions. See how
customer opinions change over time to follow brand sentiment and individual campaigns. Follow
different aspects of your business in real time to find out where you excel and where you may need
some work. With machine learning text analysis you can pull data from almost anywhere for real,
actionable insights.

Start Analyzing Your Unstructured Data


Unstructured data requires more steps and more computer analysis than structured data, because it
can’t easily fit into spreadsheets and databases. However, when you learn to use machine learning
tools, the process can be pretty painless and the results formidable.
Whether it comes from social media, customer surveys, customer service interactions, emails, etc.,
our suite of machine learning tools will ensure you get the most from your data.

You might also like