0% found this document useful (0 votes)
30 views106 pages

Dataanalytics Notes

unit 3 notes.

Uploaded by

bhikharilal0711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views106 pages

Dataanalytics Notes

unit 3 notes.

Uploaded by

bhikharilal0711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 106

MAHARANA PRATAP GROUP OF INSTITUTIONS

KOTHI MANDHANA, KANPUR

(Approved by AICTE, New Delhi and Affiliated to Dr.AKTU, Lucknow)

Digital Notes

[Department of Computer Science Engineering]


Subject Name : Data Analytics
Subject Code : KCS-051
Course : B. Tech
Branch : CSE
Semester : V
Prepared by : Mr. Anand Prakash Dwivedi
Index

Contents
Unit-1(Introduction) ................................................................................................................. 5
1.1 Why Data Analytics ................................................................................................... 5
1.2 IMPORTANCE DATA ANALYTICS .................................................................................. 5
1.3 CHARACTERISTICS OF DATA ANALYTICS...................................................................... 6
1.4 CLASSIFICATION OF DATA (STRUCTURED, SEMI-STRUCTURED, UNSTRUCTURED)......... 8
1.5 WHAT COMES UNDER BIG DATA? .............................................................................. 8
1.6 Types of Data ............................................................................................................. 9
1.7 STAGES OF BIG DATA BUSINESS ANALYTICS .............................................. 12
1.8 Key Computing Resources for Big Data .................................................... 13
1.9 BENEFITS OF BIG DATA ................................................................................. 15
1.10 Big Data Technologies ......................................................................... 16
1.11 BIG DATA CHALLENGES ............................................................................... 17
1.12 ANALYSIS VS REPORTING .............................................................................. 17
1.12 DATA ANALYTICS LIFECYCLE ......................................................................... 19
1.13 Common Tools for the Data Preparation Phase ........................................ 21
1.15 COMMON TOOLS FOR THE MODEL BUILDING PHASE .................................................. 22
KEY ROLES FOR A SUCCESSFUL ANALYTICS PROJECT .................................................. 23
Unit – 2 (Data Analysis) .................................................................................... 23
2.1 What is Regression Analysis? .................................................................. 24
2.2 APPLICATION OF REGRESSION ANALYSIS IN RESEARCH ............................ 25
2.3 USE OF REGRESSION IN ORGANIZATIONS ............................................... 26
2.4 MULTIVARIATE (LINEAR) REGRESSION......................................................... 26
2.4.1 Polynomial Regression ............................................................................ 26
2.4.2 NONLINEAR REGRESSION ................................................................ 27
2.5 Introduction to Bayesian Modeling........................................................... 27
2.6 WHAT IS A BAYESIAN NETWORK? ...................................................................... 29
2.7 Limitations of Bayesian Networks ..................................................................... 31
2.8 INTRODUCTION TO SVM ................................................................................ 32
2.9 TIME SERIES ANALYSIS ................................................................................. 34
2.10 Nonlinear vs linear..................................................................................... 35
2.11 RULE INDUCTION ....................................................................................... 35
2.12 WHAT ARE NEURAL NETWORKS?...................................................................... 36
2.13 Types of Neural Networks ............................................................................ 37
2.14 Principal Component Analysis ............................................................... 37
2.15 WHAT IS FUZZY LOGIC? ........................................................................... 39
2.16 BENEFITS OF USING FUZZY LOGIC ............................................................. 40
Unit-3 (Mining Data Streams) ............................................................................. 42
3.1 What is Data Stream Mining ........................................................................... 42
3.2 Characteristics of data stream model ................................................................... 42
3.3 Data Streaming Architecture .................................................................. 43
3.4 Methodologies for Stream Data Processing ............................................................ 45
3.5 Stream Data Processing Methods....................................................................... 46
3.6 FILTERING AND STREAMING............................................................................. 47
3.7 RTAP .................................................................................................... 49
3.8 BENEFITS OF REAL-TIME ANALYTICS .................................................................... 51
3.9 CHALLENGES ............................................................................................ 52
3.10 Use cases for real-time analytics in customer experience management ............................. 52
3.11 EXAMPLES OF REAL-TIME ANALYTICS INCLUDE: ...................................................... 53
3.12 TYPES OF REAL-TIME ANALYTICS ...................................................................... 54
3.13 Generic Design of an RTAP........................................................................... 54
3.14 Stock Market Predictions .............................................................................. 55
3.15 Real Time stock Predictions........................................................................... 55
Unit-4 (Frequent Itemsets and Clustering) ........................................................................ 56
4.1 MINING FREQUENT PATTERNS, ASSOCIATION AND CORRELATIONS: .................................. 56
4.2 WHY IS FREQ. PATTERN MINING IMPORTANT? ......................................................... 56
4.3 INTRODUCTION TO MARKET BASKET ANALYSIS ....................................................... 56
4.4 Market Basket Benefits ................................................................................. 57
4.5 ASSOCIATION RULE MINING ............................................................................. 58
4.6 Example Association Rule .............................................................................. 59
4.7 Definition: Frequent Itemset ............................................................................ 60
4.8 SUPPORT AND CONFIDENCE ............................................................................. 61
4.9 Mining Association Rules .............................................................................. 62
4.10 FREQUENT ITEMSET GENERATION ..................................................................... 63
4.11 APRIORI ALGORITHM ................................................................................... 63
4.12 LIMITATIONS ........................................................................................ 66
4.13 METHODS TO IMPROVE APRIORI’S EFFICIENCY ............................................. 66
4.14 WHAT IS CLUSTER ANALYSIS? ........................................................................ 66
4.15 GENERAL APPLICATIONS OF CLUSTERING ............................................................. 67
4.16 Requirements of Clustering in Data Mining .......................................................... 67
4.17 Similarity and Dissimilarity Measures ................................................................ 67
4.18 MAJOR CLUSTERING APPROACHES .................................................................... 68
4.19 PARTITIONING ALGORITHMS: BASIC CONCEPT ........................................................ 68
4.20 K-MEANS CLUSTERING ................................................................................ 69
4.21 K-means Algoritms .................................................................................... 69
4.22 Weaknesses of K-Mean Clustering ................................................................... 70
4.23 APPLICATIONS OF K-MEAN CLUSTERING ............................................................. 70
4.24 CONCLUSION ........................................................................................ 70
4.25 CLIQUE (CLUSTERING IN QUEST).................................................................... 71
4.26 Strength and Weakness of CLIQUE .................................................................. 71
4.27 FREQUENT PATTERN-BASED APPROACH .............................................................. 71
Unit-5 (Frame Works and Visualization) ......................................................................... 72
5.1 WHAT IS HADOOP? ...................................................................................... 72
5.2 HADOOP DISTRIBUTED FILE SYSTEM ................................................................... 72
5.3 MAPREDUCE ............................................................................................. 73
5.4 Why MapReduce is so popular ......................................................................... 73
5.5 Understanding Map and Reduce........................................................................ 74
5.6 Benefits of MapReduce ................................................................................. 75
5.7 HDFS: Hadoop Distributed File System ............................................................... 76
5.8 HDFS ARCHITECTURE .................................................................................. 77
5.9 WHAT IS HIVE? .......................................................................................... 81
5.10 Hive Vs Relational Databases ......................................................................... 82
5.11 HIVE ARCHITECTURE .................................................................................. 84
5.12 WHAT IS PIG ............................................................................................ 85
5.13 HBASE .................................................................................................. 87
5.14 WHAT IS NOSQL? ..................................................................................... 87
5.15 HBASE VS. HDFS ...................................................................................... 88
5.16 HBase and RDBMS ................................................................................... 89
5.17 NoSQL Databases ..................................................................................... 90
5.18 VISUAL DATA ANALYTICS ............................................................................. 91
5.19 Visual Analytics Process .............................................................................. 92
5.21 VISUALIZATION TOOLS ................................................................................. 93
5.22 VISUAL DATA ANALYTICS APPLICATIONS ............................................................ 94
6. Question Bank Unit Wise....................................................................................... 95
6. 1 QUESTION BANK OF UNIT 1 ............................................................................. 95
6. 2 Question Bank of Unit 2 ............................................................................... 95
6.3 QUESTION BANK OF UNIT 3 ............................................................................. 96
6.4 QUESTION BANK OF UNIT 4 ............................................................................. 96
6.5 QUESTION BANK OF UNIT 5 ............................................................................. 97
7.Multiple Choice Question Unit Wise ........................................................................... 97
7.1 MCQ’s of Unit 1 ........................................................................................ 97
7.2MCQ’s of Unit 2 ......................................................................................... 98
7.3MCQ’s of Unit 3 ....................................................................................... 100
7.4MCQ’s of Unit 4 ....................................................................................... 101
7.5 MCQ’s of Unit 5 ...................................................................................... 102
8. Previous year Question papers ...................................................................... 103
8.1 Year -2021 ............................................................................................ 103
9. NPTEL Lectures Link ........................................................................................ 105
9.1 Link 1 Introduction to Data Analytics ................................................................ 105
9.2 Link 2 Supervised Learning .......................................................................... 106
9.3 Link 3Logistic Regression ............................................................................ 106
9.4 Link 4 Support Vector Machines ..................................................................... 106
9.5 Link 5 Artificial Neural Network..................................................................... 106

Unit-1(Introduction)
1.1 Why Data Analytics

Analytics is the discovery, interpretation, and communication of meaningful patterns in data


and applying those patterns towards effective decision making .Analytics is an
encompassing and multidimensional field that uses mathematics, statistics, predictive
modeling and machine learning techniques to find meaningful patterns and knowledge in
recorded data
Data analysis is a process of inspecting, cleansing, ransforming, and modeling data.Data
analytics refers to qualitative and quantitative techniques and processes used to enhance
productivity and business gain
Why Data Analytics
Data Analytics is needed in Business to Consumer applications (B2C). Organisations collect
data that they have gathered from customers, businesses, economy and practical experience.
Data is then processed after gathering and is categorised as per the requirement and analysis
is done to study purchase patterns and etc
The process of Data Analysis
Analysis refers to breaking a whole into its separate components for individual examination.
Data analysis is a process for obtaining raw data and converting it into
information useful for decision-making by users. There are several phases that can be
distinguished :Data requirements, Data collection ,Data processing ,Data cleaning,
Exploratory data analysis, Modeling and algorithms , Data product ,Communication
Scope of Data Analytics
Bright future of data analytics, many professionals and students are interested in a career in
data analytics.Any person who likes to work on numbers, has a logical thinking, can
understand figures and can turn them into actionable insights, has a good future in this field.
A proper training of the tools of data analytics would be required to begin with. Since it is a
course that requires effort to learn and get certified, there is always dearth of qualified
professionals. Being a relatively new field also, the demand for such professionals is more
than the current supply. Higher demand also means higher salaries.

1.2 IMPORTANCE DATA ANALYTICS


● Predict customer trends and behaviours
● Analyse, interpret and deliver data in meaningful ways
● Increase business productivity
● Drive effective decision-making
Skill is required for Data analytics ?
1.) Analytical Skills
2.) Numeracy Skills
3.) Technical and Computer Skills
4.) Attention to Details
5.) Business Skills
6.) Communication Skills

The Truth About Data Analytics:


Data analytics for businesses that want to make good use of the data that they are taking in.
Businesses that can use data analytics properly are more likely than others to
succeed and thrive. But with all of the advantages of data analytics, the key benefits can be
described in this way:
● Data analytics reduces the costs associated with running a business.
● It cuts down on the time needed to come to strategy-defining decisions.
● Data analytics help to more-accurately define customer trends. Determining
the Effectiveness of Your Analytics Program
Given the growing familiarity and popularity of data analytics, there are a number of
advanced analytics programs available on the market. As such, there are certain traits to look
for in any analytics solution that will help you gage just how effective it will be in improving
your business.

1.3 CHARACTERISTICS OF DATA ANALYTICS


Big data is a term that is used to describe data that is high volume, high velocity, and/or
high variety; requires new technologies and techniques to capture, store, and analyze it;
and is used to enhance decision making, provide insight and discovery, and support and
optimize processes.
• For example, every customer e-mail, customer-service chat, and social media
comment may be captured, stored, and analyzed to better understand
customers’ sentiments. Web browsing data may capture every mouse
movement in order to better understand customers’ shopping behaviors.
• Radio frequency identification (RFID) tags may be placed on every single
piece of merchandise in order to assess the condition and location of every
item.
• Volume: Machine generated data is produced in larger quantities than non-traditional
data.
a. Data Volume
b. 44x increase from 2009-2020
c. From 0.8 zettabytes to 35zb
d. Data volume is increasing exponentially
• Velocity: This refers to the speed of data processing.
Data is begin generated fast and need to be processed fast.
Online Data Analytics

Late decisions  missing opportunities

Examples
• E-Promotions: Based on your current location, your purchase history, what
you like  send promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body  any
abnormal measurements require immediate reaction

• Variety: This refers to large variety of input data which in turn generates large
amount of data as output.
a. Various formats, types, and structures
b. Text, numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays, etc…
c. Static data vs. streaming data

## A single application can be generating/collecting many types of


data.
1.4 CLASSIFICATION OF DATA (STRUCTURED, SEMI-STRUCTURED, UNSTRUCTURED)
Big data has many sources. For example, every mouse click on a web site can be captured in
Web log files and analyzed in order to better understand shoppers’ buying behaviors and to
influence their shopping by dynamically recommending products.

Social media sources such as Face book and Twitter generate tremendous amounts of
comments and tweets. This data can be captured and analyzed to understand, for
example, what people think about new product introductions.

Machines, such as smart meters, generate data. These meters continuously stream data
about electricity, water, or gas consumption that can be shared with customers and
combined with pricing plans to motivate customers to move some of their energy
consumption, such as for washing clothes, to non-peak hours. There is a tremendous
amount of geospatial (e.g., GPS) data, such as that created by cell phones, that can
be used by applications like Four Square to help you know the locations of friends
and to receive offers from nearby stores and restaurants. Image, voice, and audio
data can be analyzed for applications such as facial recognition systems in security
systems.

1.5 WHAT COMES UNDER BIG DATA?

Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.

• Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
• Social Media Data: Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
• Stock Exchange Data: The stock exchange data holds information about the ‘buy’
and ‘sell’ decisions made on a share of different companies made by the customers.
• Power Grid Data: The power grid data holds information consumed by a particular
node with respect to a base station.
• Transport Data: Transport data includes model, capacity, distance and availability of
a vehicle.
• Search Engine Data: Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data.

1.6 Types of Data

o Structured data : Relational data.


o Semi Structured data : XML data.
o Unstructured data : Word, PDF, Text, Media Logs.

Structure Data

o It can be defined as the data that has a defined repeating pattern.


o This pattern makes it easier for any program to sort, read, and process the
data.
o Processing structured data is much faster and easier than processing data
without any specific repeating pattern.
o Is organised data in a prescribed format.
o Is stored in tabular form.
o Is the data that resides in fixed fields within a record or file.
o Is formatted data that has eities and their attributes are properly mapped.
o Is used in query and report against predetermined data types.
o Sources: DBMS/RDBMS, Flat files, Multidimensional databases, Legacy
databases
Unstructure Data
• It is a set of data that might or might not have any logical or repeating patterns.
• Typically of metadata,i . e,the additional information related to data.
• Inconsistent data (files, social media websites, satalities, etc.)
• Data in different format (e-mails, text, audio, video or images.
• Sources: Social media, Mobile Data, Text both internal & external to an organzation
Semi-Structure Data
• Having a schema-less or self-describing structure, refers to a form of structured data
that contains tags or markup element in order to separate elements and generate
hierarchies of records and fields in the given data.
• In other words, data is stored inconsistently in rows and columns of a database.
Sources: File systems such as Web data in the form of cookies, Data exchange formats

1.7 STAGES OF BIG DATA BUSINESS ANALYTICS

The different stages of business analytics are:


1. Descriptive analytics:
o Here the information that is present in the data is obtained and summarized. It
is primarily involved in finding all the statistics that describes the data. Eg:
How many buyers bought A.C. in the month of December previous years?
2. Diagnostic/ Discovery Analytics:
o This stage involves finding out the reason for the statistics determined
o in the previous analytics stage. Otherwise it involves, why that statistics have
happened? Eg: Why there is an increase/ decrease in the sales of A.C.in the
month of December?

3. Predictive Analytics:
o This stage involves predicting the possible future events based on the
information obtained from the Descriptive and/or Discovery Analytics stages.
Also in the stage possible risks involved can be identified. Eg: What shall be
the sales improvement next year (making insights for future)?
4. Prescriptive Analytics:
o It involves planning actions or making decisions to improve the Business
based on the predictive analytics . Eg: How much amount of material should
be procured to increase the production?
What made Big Data needed?

1.8 Key Computing Resources for Big Data


• Processing capability: CPU, processor, or node.
• Memory
• Storage
• Network

Techniques towards Big Data

• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization

Why Big Data now?

• More data are being collected and stored

• Open source code

• Commodity hardware

What’s driving Big Data


Optimizations and predictive analytics

- Complex statistical analysis


- All types of data, and many sources
- Very large datasets
- More of a real-time
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

1.9 BENEFITS OF BIG DATA

Big data is really critical to our life and its emerging as one of the most important
technologies in modern world. Follow are just few benefits which are very much known to all
of us:

• Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and other
advertising mediums.
• Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
• Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.

BIG DATA ANALYTICS

• Accumulation of raw data captured from various sources (i.e. discussion boards,
emails, exam logs, chat logs in e-learning systems) can be used to identify fruitful
patterns and relationships
• By itself, stored data does not generate business value, and this is true of traditional
databases, data warehouses, and the new technologies such as Hadoop for storing big
data. Once the data is appropriately stored,
• However, it can be analyzed, which can create tremendous value. A variety of
analysis technologies, approaches, and products have emerged that are especially
applicable to big data, such as in-memory analytics, in-database analytics, and
appliances

1.10 Big Data Technologies

Big data technologies are important in providing more accurate analysis, which may lead to
more concrete decision-making resulting in greater operational efficiencies, cost reductions,
and reduced risks for the business.

There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data,
we examine the following two classes of technology:

o Operational Big Data

o Analytical Big Data


Operational Big Data

• This include systems like MongoDB that provide operational capabilities for real-
time, interactive workloads where data is primarily captured and stored.
• NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations
to be run inexpensively and efficiently. This makes operational big data workloads
much easier to manage, cheaper, and faster to implement.
• Some NoSQL systems can provide insights into patterns and trends based on real-time
data with minimal coding and without the need for data scientists and additional
infrastructure.

Analytical Big Data

• This includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis
that may touch most or all of the data.
• MapReduce provides a new method of analyzing data that is complementary to the
capabilities provided by SQL, and a system based on MapReduce that can be scaled
up from single servers to thousands of high and low end machines.
• These two classes of technology are complementary and frequently deployed
together.
1.11 BIG DATA CHALLENGES

The major challenges associated with big data are as follows:

o Capturing data
o Curation
o Storage
o Searching
o Sharing
o Transfer
o Analysis
o Presentation

To fulfill the above challenges, organizations normally take the help of enterprise servers.

1.12 ANALYSIS VS REPORTING


Living in the era of digital technology and big data has made organizations dependent on the
wealth of information data can bring. You might have seen how reporting and analysis are
used interchangeably, especially the manner which outsourcing companies market their
services. While both areas are part of web analytics (note that analytics isn’t similar to
analysis), there’s a vast difference between them, and it’s more than just spelling.

It’s important that we differentiate the two because some organizations might be selling
themselves short in one area and not reap the benefits, which web analytics can bring to the
table. The first core component of web analytics, reporting, is merely organizing data into
summaries. On the other hand, analysis is the process of inspecting, cleaning, transforming,
and modeling these summaries (reports) with the goal of highlighting useful information.

Simply put, reporting translates data into information while analysis turns information into
insights. Also, reporting should enable users to ask “What?” questions about the information,
whereas analysis should answer to “Why”” and “What can we do about it?”

Here are five differences between reporting and analysis:

1. Purpose

Reporting helps companies monitor their data even before digital technology boomed.
Various organizations have been dependent on the information it brings to their business, as
reporting extracts that and makes it easier to understand.

Analysis interprets data at a deeper level. While reporting can link between cross-channels of
data, provide comparison, and make understand information easier (think of a dashboard,
charts, and graphs, which are reporting tools and not analysis reports), analysis interprets this
information and provides recommendations on actions.

2. Tasks
As reporting and analysis have a very fine line dividing them, sometimes it’s easy to confuse
tasks that have analysis labeled on top of them when all it does is reporting. Hence, ensure
that your analytics team has a healthy balance doing both.

Here’s a great differentiator to keep in mind if what you’re doing is reporting or analysis:

Reporting includes building, configuring, consolidating, organizing, formatting, and


summarizing. It’s very similar to the abovementioned like turning data into charts, graphs,
and linking data across multiple channels.

Analysis consists of questioning, examining, interpreting, comparing, and confirming. With


big data, predicting is possible as well.

3. Outputs

Reporting and analysis have the push and pull effect from its users through their outputs.
Reporting has a push approach, as it pushes information to users and outputs come in the
forms of canned reports, dashboards, and alerts.

Analysis has a pull approach, where a data analyst draws information to further probe and to
answer business questions. Outputs from such can be in the form of ad hoc responses and
analysis presentations. Analysis presentations are comprised of insights, recommended
actions, and a forecast of its impact on the company—all in a language that’s easy to
understand at the level of the user who’ll be reading and deciding on it.

This is important for organizations to realize truly the value of data, such that a standard
report is not similar to a meaningful analytics.

4. Delivery

Considering that reporting involves repetitive tasks—often with truckloads of data,


automation has been a lifesaver, especially now with big data. It’s not surprising that the first
thing outsourced are data entry services since outsourcing companies are perceived as data
reporting experts.

Analysis requires a more custom approach, with human minds doing superior reasoning and
analytical thinking to extract insights, and technical skills to provide efficient steps towards
accomplishing a specific goal. This is why data analysts and scientists are demanded these
days, as organizations depend on them to come up with recommendations for leaders or
business executives make decisions about their businesses.

5. Value

This isn’t about identifying which one brings more value, rather understanding that both are
indispensable when looking at the big picture. It should help businesses grow, expand, move
forward, and make more profit or increase their value.

Reporting Analysis
Provides data Provides answers
Reporting Analysis
Provides what is asked for Provides what is needed
Is typically standardized Is typically customized
Does not involve a person Involves a person
Is fairly inflexible Is extremely flexible

1.12 DATA ANALYTICS LIFECYCLE


• Big Data analysis differs from tradional data analysis primarily due to the volume,
velocity and va r ie ty cha ra cte r s t i c s o f t h e d a t a b e i n g processes.
• To address the distinct requirements for performing analysis on Big Data, a step-by-
step methodology is needed to organize the activities and tasks involved with
acquiring, processing, analyzing and repurposing data.

Phases-of-data-analytics-lifecycle

Phase 1: Discovery
In this phase,
• The data science team must learn and investigate the problem,
• Develop context and understanding, and
• Learn about the data sources needed and available for the project.
• In addition, the team formulates initial hypotheses that can later be tested with data.

The team should perform five main activities during this step of the discovery
phase:
• Identify data sources: Make a list of data sources the team may need to test
the initial hypotheses outlined in this phase.
o Make an inventory of the datasets currently available and those that
o can be purchased or otherwise acquired for the tests the team wants
o to perform.
• Capture aggregate data sources: This is for previewing the data and providing
high-level understanding.
o It enables the team to gain a quick overview of the data and perform
o further exploration on specific areas.
• Review the raw data: Begin understanding the interdependencies among the
data attributes.
o Become familiar with the content of the data, its quality, and its
limitations.
• Evaluate the data structures and tools needed: The data type and structure dictate
which tools the team can use to analyze the data.
• Scope the sort of data infrastructure needed for this type of problem: In addition
to the tools needed, the data influences the kind of infrastructure that's required, such
as disk storage and network capacity.
• Unlike many traditional stage-gate processes, in which the team can advance only
when specific criteria are met, the Data Analytics Lifecycle is intended to
accommodate more ambiguity.
• For each phase of the process, it is recommended to pass certain checkpoints as a way
of gauging whether the team is ready to move to the next phase of the Data Analytics
Lifecycle.

Phase 2: Data preparation

• This phase includes


• Steps to explore, Preprocess, and condition data prior to modeling and analysis.
• It requires the presence of an analytic sandbox (workspace), in which the team can
work with data and perform analytics for the duration of the project.
o The team needs to execute Extract, Load, and Transform (ELT) or extract,
transform and load (ETL) to get data into the sandbox.
o In ETL, users perform processes to extract data from a datastore, perform
data transformations, and load the data back into the datastore.
o The ELT and ETL are sometimes abbreviated as ETLT. Data should be
transformed in the ETLT process so the team can work with it and analyze it.
Rules for Analytics Sandbox
• When developing the analytic sandbox, collect all kinds of data there, as team
members need access to high volumes and varieties of data for a Big Data analytics
project.
• This can include everything from summary-level aggregated data, structured data
, raw data feeds, and unstructured text data from call logs or web logs, depending
on the kind of analysis the team plans to undertake.
• A good rule is to plan for the sandbox to be at least 5– 10 times the size of the original
datasets, partly because copies of the data may be created that serve as specific tables
or data stores for specific kinds of analysis in the project.
Performing ETLT
• As part of the ETLT step, it is advisable to make an inventory of the data and
compare the data currently available with datasets the team needs.
• Performing this sort of gap analysis provides a framework for understanding which
datasets the team can take advantage of today and where the team needs to initiate
projects for data collection or access to new datasets currently unavailable.
• A component of this subphase involves extracting data from the available sources
and determining data connections for raw data, online transaction processing
(OLTP) databases, online analytical processing (OLAP) cubes, or other data feeds.
• Data conditioning refers to the process of cleaning data, normalizing datasets, and
performing transformations on the data.
1.13 Common Tools for the Data Preparation Phase
Several tools are commonly used for this phase:
Hadoop can perform massively parallel ingest and custom analysis for web traffic analysis,
GPS location analytics, and combining of massive unstructured data feeds from multiple
sources.
Alpine Miner provides a graphical user interface (GUI) for creating analytic workflows,
including data manipulations and a series of analytic events such as staged data-mining
techniques (for example, first select the top 100 customers, and then run descriptive statistics
and clustering).
OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool
for working with messy data. A GUI-based tool for performing data transformations, and it's
one of the most robust free tools currently available.
Similar to OpenRefine, Data Wrangler is an interactive tool for data cleaning and
transformation. Wrangler was developed at Stanford University and can be used to perform
many transformations on a given dataset.

Phase 3: Model Planning


Phase 3 is model planning , where the team determines the methods, techniques, and
workflow it intends to follow for the subsequent model building phase.
• The team explores the data to learn about the relationships between variables and
subsequently selects key variables and the most suitable models.
• During this phase that the team refers to the hypotheses developed in Phase 1, when
they first became acquainted with the data and understanding the business problems
or domain area.
Common Tools for the Model Planning Phase
Here are several of the more common ones:
• R has a complete set of modeling capabilities and provides a good environment for
building interpretive models with high-quality code. In addition, it has the ability to
interface with databases via an ODBC connection and execute statistical tests.
• SQL Analysis services can perform in-database analytics of common data mining
functions, involved aggregations, and basic predictive models.
• SAS/ ACCESS provides integration between SAS and the analytics sandbox via
multiple data connectors such as OBDC, JDBC, and OLE DB. SAS itself is generally
used on file extracts, but with SAS/ ACCESS, users can connect to relational
databases (such as Oracle or Teradata).
Phase 4: Model Building
• In this phase the data science team needs to develop data sets for training, testing, and
production purposes. These data sets enable the data scientist to develop the analytical
model and train it ("training data"), while holding aside some of the data ("holdout
data" or "test data") for testing the model.
• the team develops datasets for testing, training, and production purposes.
o In addition, in this phase the team builds and executes models based on the
work done in the model planning phase.
o The team also considers whether its existing tools will sufficient for running
the models, or if it will need a more robust environment for executing models
and workflows (for example, fast hardware and parallel processing, if
applicable).
• Free or Open Source tools: Rand PL/R, Octave, WEKA, Python
• Commercial Tools: Matlab, STATISTICA.
Phase 5: Communicate Results
• In Phase 5, After executing the model, the team needs to compare the outcomes of the
modeling to the criteria established for success and failure.
• The team considers how best to articulate the findings and outcomes to the various
team members and stakeholders, taking into account warning, assumptions, and any
limitations of the results.
• The team should identify key findings, quantify the business value, and develop a
narrative to summarize and convey findings to stakeholders.
Phase 6: Operationalize
• In the final phase 6, Operationalize), the team communicates the benefits of the
project more broadly and sets up a pilot project to deploy the work in a controlled way
before broadening the work to a full enterprise or ecosystem of users.
• This approach enables the team to learn about the performance and re lated
constraints of the model in a production environment on a small scale and make
adjustments before a full deployment.
• The team delivers final reports, briefings, code, and technical documents. In addition,
the team may run a pilot project to implement the models in a production
environment.
1.15 COMMON TOOLS FOR THE MODEL BUILDING PHASE
Free or Open Source tools:
• R and PL/ R was described earlier in the model planning phase, and PL/ R is a
procedural language for PostgreSQL with R. Using this approach means that R
commands can be executed in database.
• Octave , a free software programming language for computational modeling, has
some of the functionality of Matlab. Because it is freely available, Octave is used in
major universities when teaching machine learning.
• WEKA is a free data mining software package with an analytic workbench. The
functions created in WEKA can be executed within Java code.
• Python is a programming language that provides toolkits for machine learning and
analysis, such as scikit-learn, numpy, scipy, pandas, and related data visualization
using matplotlib.
• SQL in-database implementations, such as MADlib, provide an alterative to
inmemory desktop analytical tools.
• MADlib provides an open-source machine learning library of algorithms that can
be executed in-database, for PostgreSQL or Greenplum.

KEY ROLES FOR A SUCCESSFUL ANALYTICS PROJECT


• Business User – understands the domain area
• Project Sponsor – provides requirements
• Project Manager – ensures meeting objectives
• Business Intelligence Analyst – provides business domain expertise based on deep
understanding of the data
• Database Administrator (DBA) – creates DB environment
• Data Engineer – provides technical skills, assists data management and extraction,
supports analytic sandbox
• Data Scientist – provides analytic techniques and modeling

Unit – 2 (Data Analysis)


2.1 What is Regression Analysis?

 Regression analysis is used to:


• Predict the value of a dependent variable based on the value of at least one
independent variable
• Explain the impact of changes in an independent variable on the dependent
variable
 Dependent variable: the variable we wish to explain

 Independent variable: the variable used to explain the dependent variable

Simple Linear Regression Model


• Only one independent variable, x
• Relationship between x and y is described by a linear function
• Changes in y are assumed to be caused by changes in x

resid

y = β 0 + β1x + ε
Linear component Random Error
component

Linear Regression Assumptions


• The underlying relationship between the x variable and the y variable is linear
• The distribution of the errors has constant variability
• Error values are normally distributed
• Error values are independent (over time)
Independ

ŷ i = b 0 + b1x variable

Interpretation of the Slope and the Intercept

• b0 is the estimated average value of y when the value of x is zero


• b1 is the estimated change in the average value of y as a result of a one-unit change in
x
Finding the Least Squares Equation

• The coefficients b0 and b1 will be found using computer software, such as Excel’s
data analysis add-in or MegaStat
• Other regression measures will also be computed as part of computer-based
regression analysis

Simple Linear Regression Example

 A real estate agent wishes to examine the relationship between the selling price of a
home and its size (measured in square feet)
 A random sample of 10 houses is selected
• Dependent variable (y) = house price in $1000
• Independent variable (x) = square feet

2.2 APPLICATION OF REGRESSION ANALYSIS IN RESEARCH

i. It helps in the formulation and determination of functional relationship between two


or more variables.
ii. It helps in establishing a cause and effect relationship between two variables in
economics and business research.
iii. It helps in predicting and estimating the value of dependent variable as price,
production, sales etc.
iv. It helps to measure the variability or spread of values of a dependent variable with
respect to the regression line

2.3 USE OF REGRESSION IN ORGANIZATIONS

In the field of business regression is widely used by businessmen in;


• Predicting future production
• Investment analysis
• Forecasting on sales etc.
It is also used in sociological study and economic planning to find the projections of
population, birth rates. death rates
So the success of a businessman depends on the correctness of the various estimates that he is
required to make.

METHODS OF STUDYING REGRESSION:

FREE HAND CURVE


GRAPHICALLY

LESAST SQUARES
Or
REGRESSION
LESAST SQUARES

DEVIATION METHOD FROM


ALGEBRAICALLY
AIRTHMETIC MEAN

DEVIATION METHOD FORM


ASSUMED MEAN

2.4 MULTIVARIATE (LINEAR) REGRESSION


This is a regression model with multiple independent variables
Here, the independent (regressor) variables x1, x2.... xn with only one dependent (response)
variable y
The model therefore assumes the following format;
yi = β0 + β1x1 + β2x2 + ...... βnxn+ ε
Where 1, 2, ... n, are the first index labels of the variable and the second observation.
NB: The exact values of β and ε are, and will always remain unknown

2.4.1 Polynomial Regression


This is a special case of multivariate regression, with only one independent variable
x, but an x-y relationship which is clearly nonlinear (at the same time, there is no ‘physical’
model to rely on).
y = β0 + β1x + β2x2 + β3x3.....+ βnxn + ε
Effectively, this is the same as having a multivariate model with x1 ≡ x, x2 ≡ x2, x3 ≡ x3
2.4.2 NONLINEAR REGRESSION
This is a model with one independent variable (the results can be easily extended to several)
and ‘n’ unknown parameters, which we will call b1,
b2, ... bn:
y = f (x, b) + ε
where f (x, b) is a specific (given) function of the independent variable and the ‘n’ parameters

2.5 Introduction to Bayesian Modeling


• In the social science researchers point of view, the requirements of traditional
frequentistic statistical analysis are very challenging.
• For example, the assumption of normality of both the phenomena under investigation
and the data is prerequisite for traditional parametric frequentistic calculations.

• Continuous age, income, temperature, ..

• In situations where
– a latent construct cannot be appropriately represented as a
continuous variable,
– ordinal or discrete indicators do not reflect underlying continuous variables,
– the latent variables cannot be assumed to be normally distributed,
traditional Gaussian modeling is clearly not appropriate.
• In addition, normal distribution analysis sets minimum requirements for the number
of observations, and the measurement level of variables should be continuous.

Introduction to Bayesian Modeling


• Frequentistic parametric statistical techniques are designed for normally distributed
(both theoretically and empirically) indicators that have linear dependencies.
– Univariate normality
– Multivariate normality
– Bivariate linearity
• The essence of Bayesian inference is in the rule, known as Bayes' theorem, that tells
us how to update our initial probabilities P(H) if we see evidence E, in order to find
out P(H|E).

• A priori probability
• Conditional probability
• Posteriori probability
Bayes’ Theorem
W h y does i t m a t t e r ? I f 1 % o f a p o p u l a t i o n have cancer, f o r a
screening t e s t w i t h 8 0 % s ens itivity and 9 5 % s pec ific ity;

Test P[ Te s t + v e | C a n c e r ] = 80%
Have Positive
Cance P[ Te s t + v e ]
= 5.75
r P[ Canc er ]
P[ Canc er|Tes t + v e ] ≈ 14%
... i.e. m o s t pos itive results
are ac tually false alarms

M i x i n g u p P[ A | B ] w i t h P[ B | A ] is t h e Pros ec ut or ’s Fallacy ; a
small pro b a bility o f evidence given innoc enc e need N O T mea n a
small probabilit y o f innoc enc e given evidence.

2.6 WHAT IS A BAYESIAN NETWORK?

A Bayesian network (BN) is a graphical model for depicting probabilistic relationships


among a set of variables.
• BN Encodes the conditional independence relationships between the variables in the
graph structure.
• Provides a compact representation of the joint probability distribution over the
variables
• A problem domain is modeled by a list of variables X1, …, Xn
• Knowledge about the problem domain is represented by a joint probability P(X1, …,
Xn)
• Directed links represent causal direct influences
• Each node has a conditional probability table quantifying the effects from the parents.
• No directed cycles

Bayesian Network constitutes of..


 Directed Acyclic Graph (DAG)
 Set of conditional probability tables for each node in
the graph

C D

So BN = (DAG, CPD)

 DAG: directed acyclic graph (BN’s structure)


• Nodes: random variables (typically binary or discrete, but methods
also exist to handle continuous variables)
• Arcs: indicate probabilistic dependencies between
• nodes (lack of link signifies conditional independence)
• CPD: conditional probability distribution (BN’s parameters)
• Conditional probabilities at each node, usually stored as a table
(conditional probability table, or CPT)
So, what is a DAG?

Follow the general graph


directed acyclic graphs use principles such as a node A is a
only unidirectional arrows to parent of another node B, if
show the direction of there is an arrow from node A
causation to node B.

B Informally, an arrow from


node X to node Y means X has
a direct influence on Y

C D
Each node in graph represents
a random variable

What is Inference in BN?


— Using a Bayesian network to compute probabilities is called inference
— In general, inference involves queries of the form:

P( X | E )
where X is the query variable and E is the evidence variable.

2.7 Limitations of Bayesian Networks

• Typically require initial knowledge of many probabilities…quality and extent of


prior knowledge play an important role
• Significant computational cost(NP hard task)
• Unanticipated probability of an event is not taken care of.

Representing causality in Bayesian Networks

— A causal Bayesian network, or simply causal networks, is a Bayesian network whose


arcs are interpreted as indicating cause-effect relationships
— Build a causal network:
 Choose a set of variables that describes the domain
 Draw an arc to a variable from each of its direct causes
(Domain knowledge required)

Summary
—
• Bayesian methods provide sound theory and framework for implementation of
classifiers
• Bayesian networks a natural way to represent conditional independence
information. Qualitative info in links, quantitative in tables.
• NP-complete or NP-hard to compute exact values; typical to make simplifying
assumptions or approximate methods.
• Many Bayesian tools and systems exist
• Bayesian Networks: an efficient and effective representation of the joint
probability distribution of a set of random variables
• Efficient:
o Local models
o Independence (d-separation)
• Effective:
o Algorithms take advantage of structure to
o Compute posterior probabilities
o Compute most probable instantiation
o Decision making
2.8 INTRODUCTION TO SVM
Support vector machines (SVMs) are powerful yet flexible supervised machine learning
algorithms which are used both for classification and regression. But generally, they are
used in classification problems. In 1960s, SVMs were first introduced but later they got
refined in 1990. SVMs have their unique way of implementation as compared to other
machine learning algorithms. Lately, they are extremely popular because of their ability to
handle multiple continuous and categorical variables.

Working of SVM

An SVM model is basically a representation of different classes in a hyperplane in


multidimensional space. The hyperplane will be generated in an iterative manner by SVM so
that the error can be minimized. The goal of SVM is to divide the datasets into classes to
find a maximum marginal hyperplane (MMH).

The followings are important concepts in SVM −


• Support Vectors − Datapoints that are closest to the hyperplane is called support
vectors. Separating line will be defined with the help of these data points.
• Hyperplane − As we can see in the above diagram, it is a decision plane or space
which is divided between a set of objects having different classes.
• Margin − It may be defined as the gap between two lines on the closet data points of
different classes. It can be calculated as the perpendicular distance from the line to
the support vectors. Large margin is considered as a good margin and small margin
is considered as a bad margin.
The main goal of SVM is to divide the datasets into classes to find a maximum marginal
hyperplane (MMH) and it can be done in the following two steps −
• First, SVM will generate hyperplanes iteratively that segregates the classes in best
way.
• Then, it will choose the hyperplane that separates the classes correctly.

SVM Kernels

• In practice, SVM algorithm is implemented with kernel that transforms an input data
space into the required form. SVM uses a technique called the kernel trick in which
kernel takes a low dimensional input space and transforms it into a higher
dimensional space. In simple words, kernel converts non-separable problems into
separable problems by adding more dimensions to it. It makes SVM more powerful,
flexible and accurate. The following are some of the types of kernels used by SVM.

Linear Kernel

• It can be used as a dot product between any two observations. The formula of linear
kernel is as below −
• K(x,xi)=sum(x∗xi)K(x,xi)=sum(x∗xi)
• From the above formula, we can see that the product between two vectors say 𝑥𝑥 & 𝑥𝑥𝑖𝑖
is the sum of the multiplication of each pair of input values.
Polynomial Kernel

• It is more generalized form of linear kernel and distinguish curved or nonlinear input
space. Following is the formula for polynomial kernel −
• k(X,Xi)=1+sum(X∗Xi)^dk(X,Xi)=1+sum(X∗Xi)^d
• Here d is the degree of polynomial, which we need to specify manually in the
learning algorithm.

Radial Basis Function (RBF) Kernel

• RBF kernel, mostly used in SVM classification, maps input space in indefinite
dimensional space. Following formula explains it mathematically −
• K(x,xi)=exp(−gamma∗sum(x−xi^2))K(x,xi)=exp(−gamma∗sum(x−xi^2))
• Here, gamma ranges from 0 to 1. We need to manually specify it in the learning
algorithm. A good default value of gamma is 0.1.
2.9 TIME SERIES ANALYSIS
• Aim:
– To collect and analyze the past observations to develop an appropriate model which can
then be used to generate future values for the series.
• Time Series Forecasting is based on the idea that the history of occurrences over time
can be used to predict the future

Application
• Business
• Economics
• Finance
• Science and Engineering

An overview of nonlinear dynamics Fundamental concepts


• System may be defined as an orderly working totality, a set of units combined by
nature, by science, or by art to form a whole.
• System is not just a set of elementsbut includes also interactions between both the
system’s elements and with the ‘external world’.
• Interactions may be staticor dynamic i.e. through an exchange of mass, energy,
electric charge or through exchange of information
• A living organism is an open system, supplied with free energy from biochemical
reactions. There are also effects of information interactions.
• In physics state of a system in a given moment of time is characterized by values of
state variables (at this moment).
• The minimum number of independent state variables that are necessary to
characterize the system's state is called the number of degrees of freedom of the
system. If a system has n degrees of freedom then any state of the system may be
characterized by a point in an n-dimensional space with appropriately defined
coordinates, called the system's phase space
Fundamental concepts and definitions
• Process is defined as a series of gradual changes in a system that succeed one another.
Every process exhibits a characteristic time, τ, that defines the time scale for this
process. In the system's phase space a process is represented by a series of connected
points called trajectory.
• Attractor is a subset of the system's phase space that attracts trajectories (i.e. the
system tends towards the states that belong to some attractor).
• Signal is a detectable physical quantity or impulse (as a voltage, current, magnetic
field strength) by which information can be transmitted from a given system to other
systems, e.g. to a measuring device (EEG, ECG, EMG)
• Noise is any unwanted signal that interferes with the desired signal

2.10 Nonlinear vs linear


• Linearity in science means more or less the same as proportionality or additivity. But
linearity has its limits. (Nonlinearity-nonadditivity)
• Reductionism, a methodological attitude of explaining properties of a system through
properties of its elements alone, may work only for linear systems.
• Some systems have properties that depend more on the
• way how the elements are connected than on what the specific properties of
individual elements are.
• Far from equilibrium vs equilibrium: Thermodynamic equilibrium means a complete
lack of differences between different parts of the system and, as a consequence, a
complete lack of changes in the system –all processes are stopped. 'Living' states of
any system are nonequilibriumstates.
• Equilibrium, the unique state when all properties are equally distributed, is the state of
'death'. It is true not just for a single cell or an organism. In the systems being close to
equilibrium one can observe linear processes while in the systems being far from
equilibrium processes are nonlinear. Life appears to be a nonlinear phenomenon
2.11 RULE INDUCTION
• Rule induction is one of the most important techniques of machine learning. Since
regularities hidden in data are frequently expressed in terms of rules, rule induction is
one of the fundamental tools of data mining at the same time. Usually rules are
expressions of the form

• if (attribute − 1, value − 1) and (attribute − 2, value − 2) and ···

• and (attribute − n, value − n) then (decision, value).

• Some rule induction systems induce more complex rules, in which values of attributes
may be expressed by negation of some values or by a value subset of the attribute
domain

• Data from which rules are induced are usually presented in a form sim- ilar to a table
in which cases (or examples) are labels (or names) for rows and variables are labeled
as attributes and a decision. We will restrict our attention to rule induction which
belongs to supervised learning:
• all cases are preclassied by an expert. In dierent words, the decision value is assigned
by an expert to each case. Attributes are independent variables and the decision is a
dependent variable.

• A very simple ex- ample of such a table is presented as Table 1.1, in which attributes
are:

• Temperature, Headache, Weakness, Nausea, and the decision is Flu. The set of all
cases labeled by the same decision value is called a concept. For Table 1.1, case set
f1, 2, 4, 5g is a concept of all cases aected by flu (for each case from this set the
corresponding value of Flu is yes).

Case Temperature Attributes Headache Weakness Nausea Decisio


Flu

1 41.6 yes yes no y


2 39.8 yes no yes y
3 36.8 no no no
4 37.0 yes yes yes y
5 38.8 no yes no y
6 40.2 no no no
7 36.6 no yes no

2.12 WHAT ARE NEURAL NETWORKS?

• Models of the brain and nervous system


• Highly parallel
o Process information much more like the brain than a serial computer Learning
• Very simple principles
• Very complex behaviours
• Applications
o As powerful problem solvers
o As biological models

A method of computing, based on the interaction of multiple connected processing elements.


• A powerful technique to solve many real world problems.
• The ability to learn from experience in order to improve their performance.
• Ability to deal with incomplete information

Basics Of Neural Network


• Biological approach to AI
• Developed in 1943
• Comprised of one or more layers of neurons
• Several types, we‟ll focus on feed-forward and feedback networks

2.13 Types of Neural Networks

Neural Network types can be classified based on following


attributes:
•Connection Type
- Static (feedforward)
- Dynamic (feedback)
• Topology
- Single layer
- Multilayer
- Recurrent
• Learning Methods
- Supervised
- Unsupervised
- Reinforcement
Neural Network Applications
Pattern recognition
• Investment analysis
• Control systems & monitoring
• Mobile computing
• Marketing and financial applications
• Forecasting – sales, market research, meteorology
Advantages:
• A neural network can perform tasks that a linear program can not.
• When an element of the neural network fails, it can continue without any problem by
their parallel nature.
• A neural network learns and does not need to be reprogrammed.
• It can be implemented in any application.
• It can be implemented without any problem
Disadvantages:
•The neural network needs training to operate.
•The architecture of a neural network is different from the architecture of microprocessors
therefore needs to be emulated.
•Requires high processing time for large neural networks.
Conclusions
• Neural networks provide ability to provide more human-like AI
• Takes rough approximation and hard-coded reactions out of AI design (i.e. Rules and
FSMs)
• Still require a lot of fine-tuning during development

2.14 Principal Component Analysis


• The PCA method is a statistical method for Feature Selection and Dimensionality
Reduction.
• Feature Selection is a process whereby a data space is transformed into a feature
space. In
• principal both spaces have the same dimensionality.
• However, in the PCA method, the transformation is design in such way that the data
set be represented by a reduced number of “effective” features and yet retain most of
the intrinsic information contained in the data; in other words the data set undergoes a
dimensionality reduction.
• Suppose that we have a x of dimension m and we wish to transmit it using l
numbers, where l<m. If we simply truncate the vector x, we will cause a mean square
error equal to the sum of the variances of the elements eliminated from x.
• So, we ask: Does there exist an invertible linear transformation T such that the
truncation of Tx is optimum in the mean-squared sense?
• Clearly, the transformation T should have the property that some of its components
have low variance.
• Principal Component Analysis maximises the rate of decrease of variance and is the
right choice.
• Before we present neural network, Hebbian-based, algorithms that do this we first
present the statistical analysis of the problem.

• Let X be an m-dimensional random vector representing the environment of interest.


We assume that the vector X has zero mean:

• E[X]=0

• Where E is the statistical expectation operator. If X has not zero mean we first
subtract the mean from X before we proceed with the rest of the analysis.

• Let q denote a unit vector, also of dimension m, onto which the vector X is to be
projected. This projection is defined by the inner product of the vectors X and q:

• A=XTq=qTX

• Subject to the constraint:

• • ||q||=(qTq)½=1

• The projection A is a random variable with a mean and variance related to the
statistics of vector X. Assuming that X has zero mean we can calculate the mean
value of the projection A:

• E[A]=qTE[X]=0

• The variance of A is therefore the same as its mean- square value and so we can
write:
• s2=E[A2]=E[(qTX)(XTq)]=qTE[XXT]q=qTR q
• The m-by-m matrix R is the correlation matrix of the random vector X, formally
defined as the expectation of the outer product of the vector X with itself, as shown:
• R=E[XXT]
• We observe that the matrix R is symmetric, which means that:
a single vector:
• a =[a1, a2,…, am]T
• =[xTq1, xTq2,…, xTqm]T
• =QTx
• Where Q is the matrix which is constructed by the (column) eigenvectors of R.
• From the above we see that:
• x=Q a
• This is nothing more than a coordinate
transformation from the input space, of vector x, to the feature space of the vector a.
• From the perspective of the pattern recognition the usefulness of the PCA method is
that it provides an effective technique for dimensionality reduction.
• In particular we may reduce the number of features needed for effective data
representation by discarding those linear combinations in the previous formula that
have small variances and retain only these terms that have large variances.
• Let l1, l2, …, ll denote the largest l eigenvalues of R. We may then approximate the
vector x by

2.15 WHAT IS FUZZY LOGIC?

Definition of fuzzy
Fuzzy – “not clear, distinct, or precise; blurred”
Definition of fuzzy logic
A form of knowledge representation suitable for notions that
cannot be defined precisely, but which depend upon their
contexts.

FUZZY LOGIC IN CONTROL SYSTEMS


Fuzzy Logic provides a more efficient and resourceful way to solve Control Systems.
Some Examples
Temperature Controller
Anti – Lock Break System ( ABS )
TEMPERATURE CONTROLLER
 The problem
 Change the speed of a heater fan, based off the room temperature and
humidity.
 A temperature control system has four settings
 Cold, Cool, Warm, and Hot
 Humidity can be defined by:
 Low, Medium, and High
 Using this we can define the fuzzy set.
2.16 BENEFITS OF USING FUZZY LOGIC

ANTI LOCK BREAK SYSTEM ( ABS )


Nonlinear and dynamic in nature
Inputs for Intel Fuzzy ABS are derived from
Brake
4 WD
Feedback
Wheel speed
Ignition
Outputs
Pulsewidth
Error lamp

Stochastic search

Stochastic search and optimization techniques are used in a vast number of areas,
including aerospace, medicine, transportation, and finance, to name but a few. Whether
the goal is refining the design of a missile or aircraft, determining the effectiveness of a
new drug, developing the most efficient timing strategies for traffic signals, or making
investment decisions in order to increase profits, stochastic algorithms can help
researchers and practitioners devise optimal solutions to countless real-world problems.

Introduction to Stochastic Search and Optimization: Estimation, Simulation, and


Control is a graduate-level introduction to the principles, algorithms, and practical aspects
of stochastic optimization, including applications drawn from engineering, statistics, and
computer science. The treatment is both rigorous and broadly accessible, distinguishing
this text from much of the current literature and providing students, researchers, and
practitioners with a strong foundation for the often-daunting task of solving real-world
problems.

Most widely used stochastic algorithms, including

Random search Machine (reinforcement) learning


Recursive linear estimation Model selection
Stochastic approximation Simulation-based optimization
Simulated annealing Markov chain Monte Carlo
Genetic and evolutionary algorithms Optimal experimental design
Unit-3 (Mining Data Streams)

3.1 What is Data Stream Mining

Data Stream Mining (also known as stream learning) is the process of extracting
knowledge structures from continuous, rapid data records. A data stream is an ordered
sequence of instances that in many applications of data stream mining can be read only once
or a small number of times using limited computing and storage capabilities.
In many data stream mining applications, the goal is to predict the class or value of new
instances in the data stream given some knowledge about the class membership or values of
previous instances in the data stream. Machine learning techniques can be used to learn this
prediction task from labeled examples in an automated fashion. Often, concepts from the
field of incremental learning are applied to cope with structural changes, on-line learning and
real-time demands. In many applications, especially operating within non-stationary
environments, the distribution underlying the instances or the rules underlying their labeling
may change over time, i.e. the goal of the prediction, the class to be predicted or the target
value to be predicted, may change over time.This problem is referred to as concept drift.
Detecting concept drift is a central issue to data stream mining. Other challenges that arise
when applying machine learning to streaming data include: partially and delayed labeled
data, recovery from concept drifts, and temporal dependencies.
Data stream real time analytics are needed to manage the data currently generated, at an ever
increasing rate from these applications.
•Examples:
•Financial
•Network monitoring
•Security
•Telecommunications data management
•Web applications
•Manufacturing
•Sensor networks
•Email
•blogging

Data Stream Model


• A data stream is a real time, continuous and ordered sequence of items.
• Not possible to control the order in which the items arrive, nor it is feasible to locally
store a stream in its entirety in any memory device

3.2 Characteristics of data stream model


•Data model and query processor must allow both order-based and time-based operations
•Inability to store a complete stream indicates that some approximate summary structures
must be used.
•Streaming query plans must not use any operators that require the entire input before any
results are produced.
•Any query that requires backtracking over a data stream is infeasible. This is due to storage
and performance constraints imposed by a data stream
•Applications that monitor streams in real-time must react quickly to unusual data values.
•Scalability requirements dictate that parallel and shared execution of many continuous
queries must be possible.
Data stored into 3 partitions
•Temporary working storage
•Summary storage
•Static storage for meta-data
Examples of Data Stream Applications
•Sensor networks
•Network Traffic analysis
•Financial Applications
•Transaction Log Analysis
3.3 Data Streaming Architecture

Data-streaming architectures are used to process data that's continuously produced as streams
of events over time, instead of static datasets.
▪ Compared to the traditional centralized "state of the world" databases and data warehouses,
data streaming applications work on the streams of events and on
application-specific local state that is an aggregate of the history of events. Some of
the advantages of streaming data processing are:
▪ Decreased latency from signal to decision.
▪ Unified way of handling real-time and historic data.
▪ Time travel queries.
▪ Real-time analysis of streaming data can empower you to react to events and insights as
they happen.
▪ Streaming data does not need to be discarded: data persistence pays off in a variety of ways.

With the right technologies, it’s possible to replicate streaming data to geodistributed data
centers.
▪ An effective message-passing system is much more than a queue for a real-time application:
it is the heart of an effective design for an overall big data architecture.
▪ The most disruptive idea presented here is that streaming architecture should not be limited
to specialized real-time applications.
▪ Lambda architecture is a data-processing architecture designed to handle massive quantities
of data by taking advantage of both batch- and streamprocessing methods.
▪ This approach to architecture attempts to balance latency, throughput, and faulttolerance by
using batch processing to provide comprehensive and accurate views of batch data, while
simultaneously using real-time stream processing to provide views
of online data.
▪ Lambda architecture describes a system consisting of three layers: batch processing, speed
(or real-time) processing, and a serving layer for responding to queries
▪ The processing layers ingest from an immutable master copy of the entire data set.

▪ Batch layer:
▪ The batch layer precomputes results using a distributed processing system that can handle
very large quantities of data.
▪ The batch layer aims at perfect accuracy by being able to process all available data when
generating views.
▪ This means it can fix any errors by re-computing based on the complete data set, then
updating existing views.
▪ Output is typically stored in a read-only database, with updates completely replacing
existing precomputed views.
▪ Apache Hadoop is the de facto standard batch-processing system used in most
highthroughput architectures.

Flow of data through the processing and serving layers of a generic lambda

Speed layer:
▪ The speed layer processes data streams in real time and without the requirements of
fix-ups or completeness.
▪ This layer sacrifices throughput as it aims to minimize latency by providing real-time
views into the most recent data.
▪ Essentially, the speed layer is responsible for filling the "gap" caused by the batch
layer's lag in providing views based on the most recent data.
▪ This layer's views may not be as accurate or complete as the ones eventually
produced by the batch layer, but they are available almost immediately after data is
received, and can be replaced when the batch layer's views for the same data become
available.
▪ Stream-processing technologies typically used in this layer include Apache
Storm, SQLstream and Apache Spark. Output is typically stored on fast NoSQL
databases.
▪ Serving layer:
▪ Output from the batch and speed layers are stored in the serving layer, which responds to
ad-hoc queries by returning precomputed views or building views from the processed data.
▪ Examples of technologies used in the serving layer include Druid, which provides a single
cluster to handle output from both layers.
▪ Dedicated stores used in the serving layer include Apache Cassandra or Apache HBase for
speed-layer output, and Elephant DB or Cloudera Impala for batch-layer output.
▪ Criticism of lambda architecture has focused on its inherent complexity and its
limiting influence.
▪ The batch and streaming sides each require a different code base that must be
maintained and kept in sync so that processed data produces the same result from
both paths.

3.4 Methodologies for Stream Data Processing

 Major challenges
 Keep track of a large universe, e.g., pairs of IP address, not ages
 Methodology
 Synopses (trade-off between accuracy and storage)
 Use synopsis data structure, much smaller (O(logk N) space) than their base
data set (O(N) space)
 Compute an approximate answer within a small error range (factor ε of the
actual answer)
 Major methods
 Random sampling
 Histograms
 Sliding windows
 Multi-resolution model
 Sketches
 Radomized algorithms

3.5 Stream Data Processing Methods

• Random sampling(but without knowing the total length in advance)


o Reservoir sampling: maintain a set of s candidates in the reservoir, which form
a true random sample of the element seen so far in the stream. As the data
stream flow, every new element has a certain probability (s/N) of replacing an
old element in the reservoir.
• Sliding windows
o Make decisions based only on recent data of sliding window size w
o An element arriving at time t expires at time t + w
• Histograms
o Approximate the frequency distribution of element values in a stream
o Partition data into a set of contiguous buckets
o Equal-width (equal value range for buckets) vs. V-optimal (minimizing
frequency variance within each bucket)
• Multi-resolution models
Popular models: balanced binary trees, micro-clusters, and wavelets
Stream Data Processing Methods (2)

 Sketches
Histograms and wavelets require multi-passes over the
data but sketches can operate in a single pass
Frequency moments of a stream A = {a1, …, aN}, Fk:
v
Fk = ∑ mi
k

i =1
where v: the universe or domain size, mi: the frequency
of i in the sequence
Given N elts and v values, sketches can approximate F0,
F1, F2 in O(log v + log N) space

Stream Data Processing Methods (3)

 Randomized algorithms
Monte Carlo algorithm: bound on running time but may
not return correct result

Chebyshev’s inequality: Let X be a random variable with


mean μ and standard deviation σ
σ 2
P (| X − µ |> k ) ≤
k2
Chernoff bound:
• Let X be the sum of independent Poisson trials X1, …,
Xn, δ in (0, 1]
• The probability decreases expoentially as we move
from the mean
P[ X < (1+ δ )µ |] < e
− µδ 2 / 4

3.6 FILTERING AND STREAMING


• The randomized algorithms and data structures we have seen so far always produce the
correct answer but have a small probability of being slow.
• In this lecture, we will consider randomized algorithms that are always fast, but have a
small probability of returning the wrong answer.
• More generally, we are interested in tradeoffs between the (likely) efficiency of the
algorithm and the (likely) quality of its output.

Bloom Filters
Whenever a list or set is used, and space is consideration, a Bloom filter should be
considered. When using a Bloom filter, consider the potential effects of false positives."
• It is a randomized data structure that is used to represent a set.
• It answers membership queries
• It can give FALSE POSITIVE while answering membership
• queries (very less %).

• But can't return FALSE NEGATIVE


o POSSIBLY IN SET
o DEFINITELY NOT IN SET
• Space efficient
• Bloom filters are a natural variant of hashing proposed by Burton Bloom in 1970 as a
mechanism for supporting membership queries in sets.
• Applications:
• • Example: Email spam filtering
We know 1 billion “good” email addresses
If an email comes from one of these, it is NOT spam

• To motivate the Bloom-filter idea, consider a web crawler.


• It keeps, centrally, a list of all the URL's it has found so far.
• It assigns these URL's to any of a number of parallel tasks; these tasks stream back the
URL's they find in the links they discover on a page.
• It needs to filter out those URL's it has seen before.

Role of the Bloom Filter


• A Bloom filter placed on the stream of URL's will declare that certain URL's have been
seen before.
• Others will be declared new, and will be added to the list of URL's that need to be crawled.
• Unfortunately, the Bloom filter can have false positives.
• It can declare a URL has been seen before when it hasn't.
• But if it says “never seen”, then it is truly new.

How a Bloom Filter Works?


• A Bloom filter is an array of bits, together with a number of hash functions.
• The argument of each hash function is a stream element. and it returns a position in the
array.
• Initially, all bits are 0.
• When input x arrives, we set to 1 the bits h(x). for each hash function h.
Counting Distinct Elements

Definition
•Data stream consists of a universe of elements chosen from a set of N
• Maintain a count of number of distinct items seen so far.
Let us consider a strem

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4

Elements occure multiple times, we want to count the number


of distinct elements.
Number of distinct element is n (=6 in this example)
Number of elements in this example is 11

Why do we count distinct elements?


• Number of distinct queries issued
• Unique IP addresses passing packages through a router
• Number of unique users accessing a website per month
• Number of different people passing through a traffic hub (airport, etc.)
• How many unique products we sold tonight?
• How many unique requests on a website came in today?
• How many different words did we find on a website?
• Unusually large number of words could be indicative of spam

Now, let’s constrain ourselves with limited storage…

• How to estimate (in an unbiased manner) the number of distinct elements seen?

• Flajolet-Martin (FM) Approach


• FM algorithms approximates the number of unique objects in a stream or a database in one
pass.
• If the stream contains n elements with m of them unique, this algorithm runs in O(n) time
and needs O(log(m)) memory.

3.7 RTAP
Definition
• Use of all available data and resources when they are needed.
• Consists of dynamic analysis and reporting based on data entered into a system less
than one minute before the actual time of use.

Real-time analytics software has three basic components:


• an aggregator that gathers data event streams (and perhaps batch files) from a variety
of data sources;

• a broker that makes data available for consumption; and

• an analytics engine that analyzes the data, correlates values and blends streams
together.

The system that receives and sends data streams and executes the application and real-time
analytics logic is called the stream processor.

How real-time analytics works


Real-time analytics often takes place at the edge of the network to ensure that data analysis is
done as close to the data's origin as possible. In addition to edge computing, other
technologies that support real-time analytics include:

• Processing in memory -- a chip architecture in which the processor is integrated into a


memory chip to reduce latency.

• In-database analytics -- a technology that allows data processing to be conducted


within the database by building analytic logic into the database itself.

• Data warehouse appliances -- a combination of hardware and software products


designed specifically for analytical processing. An appliance allows the purchaser to
deploy a high-performance data warehouse right out of the box.

• In-memory analytics -- an approach to querying data when it resides in random access


memory, as opposed to querying data that is stored on physical disks.

• Massively parallel programming -- the coordinated processing of a program by


multiple processors that work on different parts of the program, with each processor
using its own operating system and memory.

In order for the real-time data to be useful, the real-time analytics applications being used
should have high availability and low response times. These applications should also feasibly
manage large amounts of data, up to terabytes. This should all be done while returning
answers to queries within seconds.

The term real-time also includes managing changing data sources -- something that may arise
as market and business factors change within a company. As a result, the real-time analytics
applications should be able to handle big data. The adoption of real-time big data analytics
can maximize business returns, reduce operational costs and introduce an era where machines
can interact over the internet of things using real-time information to make decisions on their
own.

Different technologies exist that have been designed to meet these demands, including the
growing quantities and diversity of data. Some of these new technologies are based on
specialized appliances -- such as hardware and software systems. Other technologies utilize a
special processor and memory chip combination, or a database with analytics capabilities
embedded in its design

3.8 BENEFITS OF REAL-TIME ANALYTICS


Real-time analytics enables businesses to react without delay, quickly detect and respond to
patterns in user behavior, take advantage of opportunities that could otherwise be missed and
prevent problems before they arise.

Businesses that utilize real-time analytics greatly reduce risk throughout their company since
the system uses data to predict outcomes and suggest alternatives rather than relying on the
collection of speculations based on past events or recent scans -- as is the case with historical
data analytics. Real-time analytics provides insights into what is going on in the moment.

Other benefits of real-time analytics include:

• Data visualization. Real-time data can be visualized and reflects occurrences


throughout the company as they occur, whereas historical data can only be placed into a
chart in order to communicate an overall idea.

• Improved competitiveness. Businesses that use real-time analytics can identify


trends and benchmarks faster than their competitors who are still using historical data.
Real-time analytics also allows businesses to evaluate their partners' and competitors'
performance reports instantaneously.

• Precise information. Real-time analytics focuses on instant analyses that are


consistently useful in the creation of focused outcomes, helping ensure time is not wasted
on the collection of useless data.
• Lower costs. While real-time technologies can be expensive, their multiple and
constant benefits make them more profitable when used long term. Furthermore, the
technologies help avoid delays in using resources or receiving information.

• Faster results. The ability to instantly classify raw data allows queries to more
efficiently collect the appropriate data and sort through it quickly. This, in turn, allows for
faster and more efficient trend prediction and decision making.
3.9 CHALLENGES
One major challenge faced in real-time analytics is the vague definition of real time and the
inconsistent requirements that result from the various interpretations of the term. As a result,
businesses must invest a significant amount of time and effort to collect specific and detailed
requirements from all stakeholders in order to agree on a specific definition of real time, what
is needed for it and what data sources should be used.

Once the company has unanimously decided on what real time means, it faces the challenge
of creating an architecture with the ability to process data at high speeds. Unfortunately, data
sources and applications can cause processing-speed requirements to vary from milliseconds
to minutes, making creation of a capable architecture difficult. Furthermore, the architecture
must also be capable of handling quick changes in data volume and should be able to scale up
as the data grows.

The implementation of a real-time analytics system can also present a challenge to a


business's internal processes. The technical tasks required to set up real-time analytics -- such
as creation of the architecture -- often cause businesses to ignore changes that should be made
to internal processes. Enterprises should view real-time analytics as a tool and starting point
for improving internal processes rather than as the ultimate goal of the business.

Finally, companies may find that their employees are resistant to the change when
implementing real-time analytics. Therefore, businesses should focus on preparing their staff
by providing appropriate training and fully communicating the reasons for the change to real-
time analytics.

3.10 Use cases for real-time analytics in customer experience management


In customer relations management and customer experience management, real-time analytics
can provide up-to-the-minute information about an enterprise's customers and present it so
that better and quicker business decisions can be made -- perhaps even within the time span
of a customer interaction.

Here are some examples of how enterprises are tapping into real-time analytics:

• Fine-tuning features for customer-facing apps. Real-time analytics adds a level of


sophistication to software rollouts and supports data-driven decisions for core feature
management.

• Managing location data. Real-time analytics can be used to determine what data
sets are relevant to a particular geographic location and signal the appropriate updates.

• Detecting anomalies and frauds. Real-time analytics can be used to identify


statistical outliers caused by security breaches, network outages or machine failures.

• Empowering advertising and marketing campaigns. Data gathered from ad


inventory, web visits, demographics and customer behavior can be analyzed in real time
to uncover insights that hopefully will improve audience targeting, pricing strategies
and conversion rates.
Examples
3.11 EXAMPLES OF REAL-TIME ANALYTICS INCLUDE:

• Real-time credit scoring. Instant updates of individuals' credit scores allow financial
institutions to immediately decide whether or not to extend the customer's credit.

• Financial trading. Real-time big data analytics is being used to support decision-
making in financial trading. Institutions use financial databases, satellite weather stations
and social media to instantaneously inform buying and selling decisions.

• Targeting promotions. Businesses can use real-time analytics to deliver promotions


and incentives to customers while they are in the store and surrounded by the
merchandise to increase the chances of a sale.

• Healthcare services. Real-time analytics is used in wearable devices -- such


as smartwatches -- and has already proven to save lives through the ability to monitor
statistics, such as heart rate, in real time.

• Emergency and humanitarian services. By attaching real-time analytical engines


to edge devices -- such as drones -- incident responders can combine powerful
information, including traffic, weather and geospatial data, to make better informed and
more efficient decisions that can improve their abilities to respond to emergencies and
other events.

3.12 TYPES OF REAL-TIME ANALYTICS


• On-Demand Real Time Analytics
Reactive , waits for users to request a query and then
delivers the analytics.
• Continuous Real Time Analytics
Proactive and alerts users with continuous updates in
real time.

Example Applications

• Financial services
• Government
• E-Commerce Sites
• Insurance Industry

3.13 Generic Design of an RTAP


Three aspects of data flows to system
Input
Process and Store Input
Output
Key capabilities
• Delivering In-memory Transaction speeds
• Quickly moving unneeded data to disk for long term storage
• Distributing data and processing for speeds
• Supporting continuous queries for Real Time events
• Embedding data into Apps or Apps into DB
• Additional Requirements
Technologies
• Processing in Memory (PIM)
• In-database analytics
• Data warehouse appliances
• In-memory analytics
• Massively parallel programming (MPP)

3.14 Stock Market Predictions


• Checks historical stock prices and try to predict the future using different models.
• Real time – stock market trends to change continously – economic forces, new
products, competitions, events regulations.
Three basic components
• Incoming, real time trading data must be captured and stored, becoming historical
data.
• System – able to learn from historical trends in the data and recognize patterns and
probabilities to inform decisions.
• System need to do a real time comparison of new, incoming trading data with the
learned patterns and probabilities based on historical data.

Steps
• Live data, from Yahoo – read and processed
• Data stored in memory with fast
• Using live data, spark Mlib application creates and trains a model
• Results of machine learning model are pushed to other interested applications
• As data ages and starts to become cool, it is moved from Apache Geode to Apache
HAWQ and eventually lands in Apache Hadoop

3.15 Real Time stock Predictions

Data Analytics/Stock Market Predictions


Unit-4 (Frequent Itemsets and Clustering)

4.1 MINING FREQUENT PATTERNS, ASSOCIATION AND CORRELATIONS:

Basic Concepts and Methods

 Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that


occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent
itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis.

4.2 WHY IS FREQ. PATTERN MINING IMPORTANT?


 Freq. pattern: An intrinsic and important property of datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
 Classification: discriminative, frequent pattern analysis
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications

4.3 INTRODUCTION TO MARKET BASKET ANALYSIS


• Market Basket Analysis (Association Analysis) is a mathematical modeling technique
based upon the theory that if you buy a certain group of items, you are likely to buy
another group of items.
• It is used to analyze the customer purchasing behavior and helps in increasing the
sales and maintain inventory by focusing on the point of sale transaction data.
• Consider shopping cart filled with several items
• Market basket analysis tries to answer the following questions:
• Who makes purchases?
• What items tend to be purchased together
• obvious: steak-potatoes; beer-pretzels
• What items are purchased sequentially
• obvious: house-furniture; car-tires
• In what order do customers purchase items?
• What items tend to be purchased by season
• It is also about what customers do not purchase, and why.
• If customers purchase baking powder, but no flour, what are they
baking?
• If customers purchase a mobile phone, but no case, are you missing an
opportunity?
• Categorize customer purchase behavior
• identify actionable information
• purchase profiles
• profitability of each purchase profile
• use for marketing
• layout or catalogs
• select products for promotion
• space allocation, product placement

4.4 Market Basket Benefits

• Selection of promotions, merchandising strategy


 sensitive to price: Italian entrees, pizza, pies, Oriental entrees, orange
juice
• Uncover consumer spending patterns
 correlations: orange juice & waffles
• Joint promotional opportunities

• A database of customer transactions

• • Each transaction is a set of items

• • Example:

• Transaction with TID 111 contains items

• {Pen, Ink, Milk, Juice}

TID CID Date Item Qty


111 201 5/1/99 Pen 2
111 201 5/1/99 Ink 1
111 201 5/1/99 Milk 3
111 201 5/1/99 Juice 6
112 105 6/3/99 Pen 1
112 105 6/3/99 Ink 1
112 105 6/3/99 Milk 1
113 106 6/5/99 Pen 1
113 106 6/5/99 Milk 1
114 201 7/1/99 Pen 2
114 201 7/1/99 Ink 2
114 201 7/1/99 Juice 4

Coocurrences
• 80% of all customers purchase items X, Y and Z together.

Association rules
• 60% of all customers who purchase X and Y also buy Z.

Sequential patterns
• 60% of customers who first buy X also purchase Y within three weeks.

4.5 ASSOCIATION RULE MINING

• Proposed by Agrawal et al in 1993.

• It is an important data mining model studied extensively by the database and data
mining community.

• Assume all data are categorical.

• No good algorithm for numeric data.

• Initially used for Market Basket Analysis to find how items purchased by customers
are related.

Bread → Milk [sup = 5%, conf = 100%]


• Motivation: finding regularities in data

• What products were often purchased together? — Beer and diapers

• What are the subsequent purchases after buying a PC?

• What kinds of DNA are sensitive to this new drug?

• Can we automatically classify web documents?

• Given a set of transactions, find rules that will predict the occurrence of an item based
on the occurrences of other items in the transaction.

• Market-Basket transactions


TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Example of Association Rules

{Diaper}→{Beer},
{Milk,Bread}→{Eggs,Coke},
{Beer, Bread} → {Milk},
Basket Data
Retail organizations, e.g., supermarkets, collect and store massive amounts sales data, called
basket data.
A record consist of
transaction date
items bought
Or, basket data may consist of items bought by a customer over a period.
l Items frequently purchased together:
Bread ⇒PeanutButter

4.6 Example Association Rule

90% of transactions that purchase bread and butter also purchase milk
“IF” part = antecedent
“THEN” part = consequent
“Item set” = the items (e.g., products) comprising the antecedent or consequent
• Antecedent and consequent are disjoint (i.e., have no items in common)
Antecedent: bread and butter
Consequent: milk
Confidence factor: 90%

Transaction data: supermarket data

• Market basket transactions:


t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
• Concepts:
• An item: an item/article in a basket
• I: the set of all items sold in the store
• A transaction: items purchased in a basket; it may have TID (transaction ID)
• A transactional dataset: A set of transactions
4.7 Definition: Frequent Itemset

• Itemset
• A collection of one or more items
• Example: {Milk, Bread, Diaper}
• k-itemset
• An itemset that contains k items
• Support count (σ)
• Frequency of occurrence of an itemset
• E.g. σ({Milk, Bread,Diaper}) = 2
• Support
• Fraction of transactions that contain an itemset
• E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
• An itemset whose support is greater than or equal to a minsup


TID• threshold
Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
The model: data

• I = {i1, i2, …, im}: a set of items.


• Transaction t :
• t a set of items, and t ⊆ I.
• Transaction Database T: a set of transactions T = {t1, t2, …, tn}.
l I: itemset
{cucumber, parsley, onion, tomato, salt, bread, olives, cheese, butter}
l T: set of transactions
1{{cucumber, parsley, onion, tomato, salt, bread},
2 {tomato, cucumber, parsley},
3 {tomato, cucumber, olives, onion, parsley},
4 {tomato, cucumber, onion, bread},
5 {tomato, salt, onion},
6 {bread, cheese}
7 {tomato, cheese, cucumber}
8 {bread, butter}}

The model: Association rules


• A transaction t contains X, a set of items (itemset) in I, if X ⊆ t.
• An association rule is an implication of the form:

X → Y, where X, Y ⊂ I, and X ∩Y = ∅
• An itemset is a set of items.
• E.g., X = {milk, bread, cereal} is an itemset.
• A k-itemset is an itemset with k items.
• E.g., {milk, bread, cereal} is a 3-itemset

Rule strength measures


• Support: The rule holds with support sup in T (the transaction data set) if sup% of
transactions contain X ∪ Y.
• sup = probability that a transaction contains Pr(X ∪ Y)
(Percentage of transactions that contain X ∪ Y)
• Confidence: The rule holds in T with confidence conf if conf% of tranactions that
contain X also contain Y.
• conf = conditional probability that a transaction having X also contains Y
Pr(Y | X)
(Ratio of number of transactions that contain X ∪ Y to the number that contain X)
• An association rule is a pattern that states when X occurs, Y occurs with certain
probability.

4.8 SUPPORT AND CONFIDENCE

• Support count: The support count of an itemset X, denoted by X.count, in a data set
T is the number of transactions in T that contain X. Assume T has n transactions.
Then,

( X ∪ Y ).count
support =
n
( X ∪ Y ).count
confidence =
X .count
Goal: Find all rules that satisfy the user-specified minimum support (minsup)
and minimum confidence (minconf).

Association Rule
An implication expression of the form X → Y, where X and Y are itemsets
Example:
{Milk, Diaper} → {Beer}

Rule Evaluation Metrics


Support (s)
Fraction of transactions that contain both X and Y
Confidence (c)
Measure how often items in Y appear in transactions that contain X

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
Example 5 Bread, Milk, Diaper, Coke

{Milk, Diaper} ⇒ Beer


σ ( Milk, Diaper, Beer) 2
s= = = 0.4
|T| 5
σ ( Milk, Diaper, Beer) 2
c= = = 0.67
σ ( Milk, Diaper) 3
Is minimum support and minimum confidence can be automatically determined in
mining association rules?

• For the mininmum support, it all depends on the dataset. Usually, may start with a
high value, and then decrease the values until to find a value that will generate enough
paterns.
• For the minimum confidence, it is a little bit easier because it represents the
confidence that you want in the rules. So usually, use something like 60 % . But it
also depends on the data.
• In terms of performance, when minsup is higher you will find less pattern and the
algorithm is faster. For minconf, when it is set higher, there will be less pattern but it
may not be faster because many algorithms don't use minconf to prune the search
space. So obviously, setting these parameters also depends on how many rules you
want.

4.9 Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
if an itemset is frequent, each of its subsets is frequent as well.
 This property belongs to a special category of properties called
antimonotonicity in the sense that if a set cannot pass a test, all of its supersets
will fail the same test as well.
 Rule Generation
– Generate high confidence rules from each frequent itemset, where each
rule is a binary partitioning of a frequent itemset
• Frequent itemset generation is still computationally expensive

4.10 FREQUENT ITEMSET GENERATION

• An itemset X is closed in a data set D if there exists no proper super-itemset Y* such


that Y has the same support count as X in D.
*(Y is a proper super-itemset of X if X is a proper sub-itemset of Y, that is, if X ⊂ Y.
In other words, every item of X is contained in Y but there is at least one item of Y that
is not in X.)
• An itemset X is a closed frequent itemset in set D if X is both closed and frequent in
D.
An itemset X is a maximal frequent itemset (or max-itemset) in a data set D if X is frequent,
and there exists no super-itemset Y such that X ⊂ Y and Y is frequent in D

Frequent Itemset Generation Strategies

• Reduce the number of candidates (M)


• Complete search: M=2d
• Use pruning techniques to reduce M
• Reduce the number of transactions (N)
• Reduce size of N as the size of itemset increases
• Used by DHP (Direct Hashing & Purning) and vertical-based mining
algorithms
• Reduce the number of comparisons (NM)
• Use efficient data structures to store the candidates or transactions
No need to match every candidate against every transaction

Many mining algorithms

• There are a large number of them!!


• They use different strategies and data structures.
• Their resulting sets of rules are all the same.
• Given a transaction data set T, and a minimum support and a minimum
confident, the set of association rules existing in T is uniquely determined.
• Any algorithm should find the same set of rules although their computational
efficiencies and memory requirements may be different.
• We study only one: the Apriori Algorithm

4.11 APRIORI ALGORITHM


• The algorithm uses a level-wise search, where k-itemsets are used to explore (k+1)-
itemsets
• In this algorithm, frequent subsets are extended one item at a time (this step is known
as candidate generation process)
• Then groups of candidates are tested against the data.
• It identifies the frequent individual items in the database and extends them to larger
and larger item sets as long as those itemsets appear sufficiently often in the database.
• Apriori algorithm determines frequent itemsets that can be used to determine
association rules which highlight general trends in the database.
• The Apriori algorithm takes advantage of the fact that any subset of a frequent itemset
is also a frequent itemset.
• i.e., if {l1,l2} is a frequent itemset, then {l1} and {l2} should be frequent
itemsets.
• The algorithm can therefore, reduce the number of candidates being considered by
only exploring the itemsets whose support count is greater than the minimum support
count.
• All infrequent itemsets can be pruned if it has an infrequent subset.
How do we do that?
• So we build a Candidate list of k-itemsets and then extract a Frequent list of k-
itemsets using the support count
• After that, we use the Frequent list of k-itemsets in determing the Candidate and
Frequent list of k+1-itemsets.
• We use Pruning to do that
• We repeat until we have an empty Candidate or Frequent of k-itemsets
• Then we return the list of k-1-itemsets.
KEY CONCEPTS
• Frequent Itemsets: All the sets which contain the item with the minimum support (denoted
by L𝑖𝑖 for ith itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining Lk-1 with
itself.

The Apriori Algorithm : Pseudo Code


Goal: find all itemsets I s.t. supp(I) > minsupp
• For each item X check if supp(X) > minsupp then retain I1 = {X}
• K=1
• Repeat

• For every itemset Ik, generate all itemsets Ik+1 s.t. Ik ⊂ Ik+1
• Scan all transactions and compute supp(Ik+1) for all itemsets Ik+1
• Drop itemsets Ik+1 with support < minsupp
• Until no new frequent itemsets are found
Association Rules
Finally, construct all rules X → Y s.t.
• XY has high support
• Supp(XY)/Supp(X) > min-confidence
4.12 LIMITATIONS
• Apriori algorithm can be very slow and the bottleneck is candidate generation.
• For example, if the transaction DB has 104 frequent 1- itemsets, they will generate 107
candidate 2-itemsets even after employing the downward closure.
• To compute those with sup more than min sup, the database need to be scanned at every
level. It needs (n +1 ) scans, where n is the length of the longest pattern.
4.13 METHODS TO IMPROVE APRIORI’S EFFICIENCY
• Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is
below the threshold cannot be frequent
• Transaction reduction: A transaction that does not contain any frequent k-itemset is useless
in subsequent scans
• Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one
of the partitions of DB.
• Sampling: mining on a subset of given data, lower support threshold + a method to
determine the completeness
• Dynamic itemset counting: add new candidate itemsets only when all of their subsets are
estimated to be frequent.
4.14 WHAT IS CLUSTER ANALYSIS?
•Cluster: a collection of data objects
•Similar to one another within the same cluster

•Dissimilar to the objects in other clusters


•Cluster analysis
•Grouping a set of data objects into clusters
•Clustering is unsupervised classification: no predefined classes

•Typical applications
•As a stand-alone tool to get insight into data distribution

•As a preprocessing stepfor other algorithms


•Given a collection of objects, put objects into groups based on similarity.
•Used for “discovery-based” science, to find unexpected patterns in data.
•Also called “unsupervised learning” or “data mining”
•Inherently an ill-defined problem

4.15 GENERAL APPLICATIONS OF CLUSTERING


Pattern Recognition
•Spatial Data Analysis
•create thematic maps in GIS by clustering feature spaces
•detect spatial clusters and explain them in spatial data mining
•Image Processing
•Economic Science (especially market research)
•WWW
•Document classification
•Cluster Weblog data to discover groups of similar access patterns

4.16 Requirements of Clustering in Data Mining


•Scalability
•Ability to deal with different types of attributes
•Discovery of clusters with arbitrary shape
•Minimal requirements for domain knowledge to determine input parameters
•Able to deal with noise and outliers
•Insensitive to order of input records
•High dimensionality
•Incorporation of user-specified constraints
•Interpretability and usability

4.17 Similarity and Dissimilarity Measures


•In clustering techniques, similarity (or dissimilarity) is an important measurement.
•Informally, similarity between two objects (e.g., two images, two documents, two records,
etc.) is a numerical measure of the degree to which two objects are alike.
•The dissimilarity on the other hand, is another alternative (or opposite) measure of the
degree to which two objects are different.
•Both similarity and dissimilarity also termed as proximity.
•Usually, similarity and dissimilarity are non-negative numbers and may range from zero
(highly dissimilar (no similar)) to some finite/infinite value (highly similar (no
dissimilar)).Note:
•Frequently, the term distance is used as a synonym for dissimilarity
•In fact, it is used to refer as a special case of dissimilarity.

Measure the Quality of Clustering


• Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function,
which is typically metric: d(i, j)
• There is a separate “quality” function that measures the “goodness” of a cluster.
• The definitions of distance functions are usually very different for interval-scaled,
boolean, categorical, ordinal and ratio variables.
• Weights should be associated with different variables based on applications and data
semantics.
• It is hard to define “similar enough” or “good enough”
• the answer is typically highly subjective.

4.18 MAJOR CLUSTERING APPROACHES

• Partitioning algorithms: Construct various partitions and then evaluate them by


some criterion

• Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or


objects) using some criterion

• Density-based: based on connectivity and density functions

• Grid-based: based on a multiple-level granularity structure

• Model-based: A model is hypothesized for each of the clusters and the idea is to find
the best fit of that model to each other

• Important distinction between partitional and hierarchical sets of clusters

• Partitional Clustering

• A division data objects into non-overlapping subsets (clusters) such that each
data object is in exactly one subset

• Hierarchical clustering

• A set of nested clusters organized as a hierarchical tree

4.19 PARTITIONING ALGORITHMS: BASIC CONCEPT


• Partitioning method: Construct a partition of a database D of n objects into a set of k
clusters
• Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion

• Global optimal: exhaustively enumerate all partitions


• Heuristic methods: k-means and k-medoids algorithms

• k-means (MacQueen’67): Each cluster is represented by the center of the


cluster

• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87):


Each cluster is represented by one of the objects in the cluster

4.20 K-MEANS CLUSTERING


•Simple Clustering: K-means
•Given k, the k-meansalgorithm consists of foursteps:
(Basic version works with numeric data only)
1) Select initial centroids at random -Pick a number (K) of cluster centers -centroids (at
random)
2) Assign every item to its nearest cluster center (e.g. using Euclidean distance)
3) Move each cluster center to the mean of its assigned items
4) Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)

4.21 K-means Algoritms


•Initialization
•Arbitrarily choose k objects as the initial cluster centers (centroids)
•Iteration until no change
•For each object Oi
•Calculate the distances between Oiand the k centroids
•(Re)assign Oito the cluster whose centroidis the closest to Oi
•Update the cluster centroidsbased on current assignment
4.22 Weaknesses of K-Mean Clustering

1. When the numbers of data are not so many, initial grouping will determine the cluster
significantly.
2. The number of cluster, K, must be determined before hand. Its disadvantage is that it does
not yield the same result with each run, since the resulting clusters depend on the initial
random assignments.
3. We never know the real cluster, using the same data, because if it is inputted in a different
order it may produce different cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition may produce different result of
cluster. The algorithm may be trapped in the local optimum.

4.23 APPLICATIONS OF K-MEAN CLUSTERING

• It is relatively efficient and fast. It computes result at O(tkn), where n is number of objects
or points, k is number of clusters and t is number of iterations.
• k-means clustering can be applied to machine learning or data mining
• Used on acoustic data in speech understanding to convert waveforms into one of k
categories (known as Vector Quantization or Image Segmentation).
• Also used for choosing color palettes on old fashioned graphical display devices and Image
Quantization.

4.24 CONCLUSION

• K-means algorithm is useful for undirected knowledge discovery and is relatively simple.
• K-means has found wide spread usage in lot of fields, ranging from unsupervised learning
of neural network, Pattern recognitions, Classification analysis, Artificial intelligence, image
processing, machine vision, and many others.

4.25 CLIQUE (CLUSTERING IN QUEST)


• Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)
• Automatically identifying subspaces of a high dimensional data space that allow better
clustering than original space
• CLIQUE can be considered as both density-based and grid-based
– It partitions each dimension into the same number of equal length interval
– It partitions an m-dimensional data space into non-overlapping rectangular
units
– A unit is dense if the fraction of total data points contained in the unit exceeds
the input model parameter
– A cluster is a maximal set of connected dense units within a subspace
CLIQUE: The Major Steps
• Partition the data space and find the number of points that lie inside each cell of the
partition.
• Identify the subspaces that contain clusters using the Apriori principle
• Identify clusters
– Determine dense units in all subspaces of interests
– Determine connected dense units in all subspaces of interests.
• Generate minimal description for the clusters
– Determine maximal regions that cover a cluster of connected dense units for
each cluster
– Determination of minimal cover for each cluster

4.26 Strength and Weakness of CLIQUE


• Strength
– automatically finds subspaces of the highest dimensionality such that high
density clusters exist in those subspaces
– insensitive to the order of records in input and does not presume some canonical
data distribution
– scales linearly with the size of input and has good scalability as the number of
dimensions in the data increases
• Weakness
– The accuracy of the clustering result may be degraded at the expense of
simplicity of the method
4.27 FREQUENT PATTERN-BASED APPROACH
• Clustering high-dimensional space (e.g., clustering text documents, microarray data)
– Projected subspace-clustering: which dimensions to be projected on?
• CLIQUE, ProClus
– Feature extraction: costly and may not be effective?
– Using frequent patterns as “features”
• “Frequent” are inherent features
• Mining freq. patterns may not be so expensive
Unit-5 (Frame Works and Visualization)

5.1 WHAT IS HADOOP?

• The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models.
• It is made by apache software foundation in 2011.
• Written in JAVA.
o Hadoop is open source software.
o Framework
o Massive Storage
o Processing Power
• Big data is a term used to define very large amount of unstructured and
semi structured data a company creates.
•The term is used when talking about Petabytes and Exabyte of data.
•That much data would take so much time and cost to load into relational
database for analysis.
•Facebook has almost 10billion photos taking up to 1Petabytes of storage.
So what is the problem??
1. Processing that large data is very difficult in relational database.
2. It would take too much time to process data and cost.

We can solve this problem by Distributed Computing.

But the problems in distributed computing is –


• Hardware failure
Chances of hardware failure is always there.
• Combine the data after analysis
Data from all disks have to be combined from all the disks which is a mess.

To Solve all the Problems Hadoop Came


It has two main parts –
1. Hadoop Distributed File System (HDFS),
2. Data Processing Framework & MapReduce

5.2 HADOOP DISTRIBUTED FILE SYSTEM

It ties so many small and reasonable priced machines together into a single cost
effective computer cluster.

Data and application processing are protected against hardware failure.


If a node goes down, jobs are automatically redirected to other nodes to make sure the
distributed computing does not fail.
it automatically stores multiple copies of all data.
It provides simplified programming model which allows user to quickly read and write
the distributed system.

5.3 MAPREDUCE

MapReduce is a programming model for processing and generating large data sets with a
parallel, distributed algorithm on a cluster.
It is an associative implementation for processing and generating large data sets.
MAP function that process a key pair to generates a set of intermediate key pairs.
REDUCE function that merges all intermediate values associated with the same
intermediate key

• Now that we have described how Hadoop stores data, lets turn our attention to how it
processes data
• We typically process data in Hadoop using MapReduce
• MapReduce is not a language, it’s a programming model
• MapReduce is a method for distributing a task across multiple nodes. Each node
processes data stored on that node.
• MapReduce consists of two functions:
 map (K1, V1) -> (K2, V2)
reduce (K2, list(V2)) -> list(K3, V3)

5.4 Why MapReduce is so popular

 Automatic parallelization and distribution (The biggest advantage).


 Fault-tolerance (individual tasks can be retried)
 Hadoop comes with standard status and monitoring tools.
 A clean abstraction for developers.
 MapReduce programs are usually written in Java (possibly in other languages using
streaming)
The map function always runs first
• Typically used to “break down”
• Filter, transform, or parse data, e.g. Parse the stock symbol, price and
time from a data feed
• The output from the map function (eventually) becomes the input to the reduce
function
The reduce function
• Typically used to aggregate data from the map function
• e.g. Compute the average hourly price of the stock
• Not always needed and therefore optional
• You can run something called a “map-only” job

5.5 Understanding Map and Reduce


Between these two tasks there is typically a hidden phase known as the “Shuffle and Sort”
• Which organizes map output for delivery to the reducer Each individual piece
is simple, but collectively are quite powerful
• Analogous to a pipe / filter in Unix

Terminology
• The client program submits a job to Hadoop.
• The job consists of a mapper, a reducer, and a list of inputs.
• The job is sent to the JobTracker process on the Master Node.
• Each Slave Node runs a process called the TaskTracker.
• The JobTracker instructs TaskTrackers to run and monitor tasks.
• A Map or Reduce over a piece of data is a single task.
• A task attempts is an instance of a task running on a slave node.
MapReduce Failure Recovery

 Task processes send heartbeats to the TaskTracker.


 TaskTrackers send heartbeats to the JobTracker.
 Any task that fails to report in 10 minutes is assumed to have failed- its JVM is killed
by the TaskTracker.
 Ay task that throws an exception is said to have failed.
 Failed tasks are reported to the JobTracker by the TaskTracker.
 The JobTracker reschedules any failed tasks - it tries to avoid rescheduling the task on
the same TaskTracker where it previously failed.
 If a task fails more than 4 times, the whole job fails.
TaskTracker Recovery

 Any TaskTracker that fails to report in 10 minutes is assumed to have crashed.


 All tasks on the node are restarted elsewhere
 Any TaskTracker reporting a high number of failed tasks is blacklisted, to
prevent the node from blocking the entire job.
 There is also a “global blacklist”, for TaskTrackers which fail on multiple
jobs.
 The JobTracker manages the state of each job and partial results of failed tasks are
ignored.
Example: Word Count
• We have a large file of words, one word to a line
• Count the number of times each distinct word appears in the file
• Sample application: analyze web server logs to find popular URLs
MapReduce
• Input: a set of key/value pairs
• User supplies two functions:
• map(k,v)  list(k1,v1)
• reduce(k1, list(v1))  v2
• (k1,v1) is an intermediate key/value pair
• Output is the set of (k1,v2) pairs

MapReduce: Word Count

5.6 Benefits of MapReduce

• Simplicity (via fault tolerance)


• Particularly when compared with other distributed programming models
• Flexibility
• Offers more analytic capabilities and works with more data types than
platforms like SQL
• Scalability
• Because it works with
• Small quantities of data at a time
• Running in parallel across a cluster
• Sharing nothing among the participating nodes

5.7 HDFS: Hadoop Distributed File System

• Based on Google's GFS (Google File System)


• Provides inexpensive and reliable storage for massive amounts of data
• Optimized for a relatively small number of large files
• Each file likely to exceed 100 MB, multi-gigabyte files are common
• Store file in hierarchical directory structure
• e.g. , /sales/reports/asia.txt
• Cannot modify files once written
• Need to make changes? remove and recreate
• Data is distributed across all nodes at load time
• Provides for efficient Map Reduce processing
• Use Hadoop specific utilities to access HDFS
HDFS Design
• Runs on commodity hardware
• Assumes high failure rates of the components
• Works well with lots of large files
• Hundred of Gigabytes or Terabytes in size
• Built around the idea of “write-once, read many-times”
• Large streaming reads
• Not random access
• Responsible for storing data on the cluster
• Data files are split into blocks and distributed across the nodes in the cluster (Each
block is replicated multiple times)
• High throughput is more important than low latency
HDFS and Unix File System
• In some ways, HDFS is similar to a UNIX filesystem
• Hierarchical, with UNIX/style paths (e.g. /sales/reports/asia.txt)
• UNIX/style file ownership and permissions
• There are also some major deviations from UNIX
• No concept of a current directory
• Cannot modify files once written
• You can delete them and recreate them, but you can't modify them
• Must use Hadoop specific utilities or custom code to access HDFS
5.8 HDFS ARCHITECTURE

Hadoop has a master/slave architecture


HDFS master daemon: Name Node
• It stores metadata and manages access
• Manages namespace (file to block mappings) and metadata (block to machine
mappings)
• Monitors slave nodes
HDFS slave daemon: Data Node
• Reads and writes the actual data
Provides reliability through replication
• Each Block is replicated across several Data Nodes
How are Files Stored

• Generally the user data is stored in the files of HDFS.


• Files are split into blocks
• Blocks are split across many machines at load time
• Different blocks from the same file will be stored on different machines
• Blocks are replicated across multiple machines
• The NameNode keeps track of which blocks make up a file and where they are stored
• In other words, the minimum amount of data that HDFS can read or write is called a
Block.
• The default block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.

Example:
The NameNodeholds metadata for the two files
• Foo.txt (300MB) and Bar.txt (200MB)
• Assume HDFS is configured for 128MB blocks
The DataNodeshold the actual blocks
• Each block is 128MB in size
• Each block is replicated three times on the cluster
• Block reports are periodically sent to the NameNode

HDFS Architecture
Role of NameNode
 The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software.
 It is a software that can be run on commodity hardware.
 The system having the namenode acts as the master server.
‒ It stores all metadata: filenames, locations of each block on Data Nodes, file
attributes, etc...
‒ Block and Replica management
‒ Health of Data Nodes through block reports
‒ Keeps metadata in RAM for fast lookup
‒ Regulates client’s access to files.
‒ It also executes file system operations such as renaming, closing, and opening
files and directories
Functionalities of NameNode
 Running on a single machine, the NameNode daemon determines and tracks where
the various blocks of a data file are stored.
 If a client application wants to access a particular file stored in HDFS, the
application contacts the NameNode.
 NameNode provides the application with the locations of the various blocks for
that file.
 For performance reasons, the NameNode resides in a machine’s memory.
 Because the NameNode is critical to the operation of HDFS, any unavailability or
corruption of the NameNode results in a data unavailability event on the cluster.
 Thus, the NameNode is viewed as a single point of failure in the Hadoop
environment.
 To minimize the chance of a NameNode failure and to improve performance, the
NameNode is typically run on a dedicated machine.
Role of DataNode
 The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software.
 For every node (Commodity hardware/System) in a cluster, there will be a datanode.
‒ The DataNode daemon manages the data stored on each machine.
‒ It stores file contents as blocks.
‒ Different blocks of the same file are stored on different Datanodes
‒ Same block is replicated across several Datanodes for redundency

Fuctionalities of DataNode
• Datanodes perform read-write operations on the file systems, as per client request.

• They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.

• Each DataNode periodically builds a report about the blocks stored on the
DataNode and sends the report to the NameNode.

Role of Secondary NameNode


 A helper node for Namenode

 Performs memory-intensive administrative functions for the Namenode

 Have a check point for the file system (HDFS)

 Not a Backup Node

 Recommended to run on a separate machine

• It requires as much RAM as the primary NameNode

Functionalities of Secondary NameNode


• A third daemon, the Secondary NameNode.
• It provides the capability to perform some of the NameNode tasks to reduce the
load on the NameNode.

• Such tasks include updating the file system image with the contents of the file system
edit logs.

• In the event of a NameNode outage, the NameNode must be restarted and initialized
with the last file system image file and the contents of the edits logs.

• Periodically combines a prior filesystem snapshot and editing into a new snapshot.
New snapshot is sent back to the NameNode.

NameNode Failure
• Loosing a NameNode is equivalent to losing all the files on the filesystem

• Hadoop provides two options:

• Back up files that make up the pesistent state of the file system (local or NFS
mount)

• Run a Secondary NameNode

DataNode Failure and Recovery


• DatNodes exchange heartbeats with NameNode

• If no heartbeat received within a certain time period DataNode is assumed to be lost.

• NameNode determines which blocks were on the lost node

• NameNode finds other copies of these 'lost' blocks and replicates them to other
nodes.

• Block replication is actively maintained.

5.9 WHAT IS HIVE?


Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Hive makes job easy for performing operations like
Data encapsulation
Ad-hoc queries
Analysis of huge datasets
Important characteristics of Hive
• In Hive, tables and databases are created first and then data is loaded into these tables.

• Hive as data warehouse designed for managing and querying only structured data that
is stored in tables.

• While dealing with structured data, Map Reduce doesn't have optimization and
usability features but Hive framework does.
• Query optimization refers to an effective way of query execution in terms of
performance.

• Hive's SQL-inspired language separates the user from the complexity of Map Reduce
programming.

• It reuses familiar concepts from the relational database world, such as tables, rows,
columns and schema, etc. for ease of learning.

• Hadoop's programming works on flat files.

• So, Hive can use directory structures to "partition" data to improve performance on
certain queries.

5.10 Hive Vs Relational Databases


• Relational databases are of "Schema on READ and Schema on Write". First
creating a table then inserting data into the particular table. On relational database
tables, functions like Insertions, Updates, and Modifications can be performed.

• Hive is "Schema on READ only". So, functions like the update, modifications, etc.
don't work with this. Because the Hive query in a typical cluster runs on multiple Data
Nodes. So it is not possible to update and modify data across multiple nodes.

• Also, Hive supports "READ Many WRITE Once" pattern.

• Which means that after inserting table we can update the table in the latest Hive
versions.

Hive Components
• High-level language (HiveQL)
• Set of commands
Two Main
Components • Two execution modes
• Local: reads/write to local file system
• Mapreduce: connects to Hadoop cluster and
reads/writes to HDFS

• Interactive mode
• Console
Two modes
• Batch mode
• Submit a script
Hive deals with Structured Data
• Hive Data Models:

• The Hive data models contain the following components:

 Databases : 3-Levels: Tables  Partitions  Buckets

 Tables : maps to a HDFS directory

 Partitions : maps to sub-directories under the table

 Buckets or clusters : maps to files under each partition

Partitions:
• Partition means dividing a table into a coarse grained parts based on the value
of a partition column such as ‘data’. This makes it faster to do queries on
slices of data.

• The Partition keys determine how data is stored. Here, each unique value of
the Partition key defines a Partition of the table. The Partitions are named after
dates for convenience. It is similar to ‘Block Splitting’ in HDFS.

• Allows users to efficiently retrieve rows

• Buckets:

• Buckets give extra structure to the data that may be used for efficient queries.
 Split data based on hash of a column – mainly for parallelism

 Data in each partition may in turn be divided into Buckets based on the value
of a hash function of some column of a table.

5.11 HIVE ARCHITECTURE


Hive Consists of Mainly 3 core parts
1. Hive Clients
2. Hive Services
3. Hive Storage and Computing

Hive Clients
• Hive provides different drivers for communication with a different type of
applications. For Thrift based applications, it will provide Thrift client for
communication.

• For Java related applications, it provides JDBC Drivers.

• Other than any type of applications provided ODBC drivers.

• These Clients and drivers in turn again communicate with Hive server in the Hive
services.

Hive Services
• Client interactions with Hive can be performed through Hive Services.

• If the client wants to perform any query related operations in Hive, it has to
communicate through Hive Services.

• CLI is the command line interface acts as Hive service for DDL (Data definition
Language) operations.

• All drivers communicate with Hive server and to the main driver in Hive services as
shown in above architecture diagram.

• Driver present in the Hive services represents the main driver, and it communicates all
type of Thrift, JDBC, ODBC, and other client specific applications.

• Driver will process those requests from different applications to meta store and field
systems for further processing.

Hive Storage and Computing


Hive services such as Meta store, File system, and Job Client in turn communicates with Hive
storage and performs the following actions
• Metadata information of tables created in Hive is stored in Hive "Meta storage
database".

• Query results and data loaded in the tables are going to be stored in Hadoop
cluster on HDFS.

5.12 WHAT IS PIG


 A platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs.

 Compiles down to MapReduce jobs

 Developed by Yahoo!

 Open-source language

 Roughly 30% of Hadoop jobs run at Yahoo! are Pig jobs.


Pig Latin
 Pig provides a higher level language, Pig Latin, that:

– Increases productivity. In one test


• 10 lines of Pig Latin ≈ 200 lines of Java.
• What took 4 hours to write in Java took 15 minutes in Pig Latin.
– Opens the system to non-Java programmers.
– Provides common operations like join, group, filter, sort.

Pig Engine
• Pig provides an execution engine at Hadoop

– Removes need for users to tune Hadoop for their needs.


– Insulates users from changes in Hadoop interfaces.
Why a New Language?
• Pig Latin is a Data Flow Language rather than procedural or declarative.

• User code and existing binaries can be included almost anywhere.

• Metadata not required, but used when available.

• Support for nested types.

• Operates on files in HDFS.

Pig Components
• High-level language (Pig Latin)
Two Main • Set of commands
Components
• Two execution modes
• Local: reads/write to local file system
• Mapreduce: connects to Hadoop
cluster and reads/writes to HDFS

• Interactive mode
Two modes • Console

• Batch mode
• Submit a script
12

Pig: Language Features


• Keywords

• Load, Filter, Foreach Generate, Group By, Store, Join, Distinct, Order By, …
• Aggregations

• Count, Avg, Sum, Max, Min

• Schema

• Defines at query-time not when files are loaded

• UDFs

• Packages for common input/output formats

5.13 HBASE
• HBase is a distributed column-oriented data store built on top of HDFS
• HBase is an Apache open source project whose goal is to provide storage for the Hadoop
Distributed Computing
• It is a part of the Hadoop ecosystem that provides random real-time read/write access to
data in the Hadoop File System.
• Data is logically organized into tables, rows and columns

Difference
• Hive and HBase are two different Hadoop based technologies –
• Hive is an SQL-like engine that runs MapReduce jobs, and
• HBase is a NoSQL key/value database on Hadoop.
• Just like Google can be used for search and Facebook for social networking, Hive can be
used for analytical queries while HBase for real-time querying.

5.14 WHAT IS NOSQL?


• Stands for Not Only SQL
• Key features (advantages):
• non-relational
• don’t require schema
• data are replicated to multiple nodes (so, identical & faulttolerant)
and can be partitioned:
• down nodes easily replaced
• no single point of failure
• horizontal scalable
• cheap, easy to implement (open-source)
• massive write performance
• fast key-value access 4
3
Hbase is NOT
• A sql Database – No Joins, no query engine, no datatypes, no (damn) sql
• No Schema
• No DBA needed

HBase: Part of Hadoop’s Ecosystem


5.15 HBASE VS. HDFS

• Both are distributed systems that scale to hundreds or thousands


of nodes
• HDFS is good for batch processing (scans over big files)
• Not good for record lookup
• Not good for incremental addition of small batches
• Not good for updates
• HBase is designed to efficiently address the above points
• Fast record lookup
• Support for record-level insertion
• Support for updates (not in place)
• HBase updates are done by creating new versions of values
5.16 HBase and RDBMS
Storage Model in HBase

HBase is a column-oriented database and the tables in it are sorted by row.


• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
• Each cell value has a timestamp

5.17 NoSQL Databases


• NoSQL is a non-relational database management systems, different from traditional
relational database management systems in some significant ways.
• It is designed for distributed data stores where very large scale of data storing needs (for
example Google or Facebook which collects terabits of data every day for their users).
• These type of data storing may not require fixed schema, avoid join operations and typically
scale horizontally.
• It provides a mechanism for storage and retrieval of data other than tabular relations model
used in relational databases.
• NoSQL database doesn't use tables for storing data. It is generally used to store big data and
real-time web applications.
Why NoSQL
• NoSQL databases first started out as in-house solutions to real problems in companies such
as Amazon Dynamo, Google and others. These companies found that SQL didn’t meet their
requirements.
• In particular, these companies faced three primary issues: unprecedented transaction
volumes, expectations of low-latency access to massive datasets, and nearly perfect service
availability while operating in an unreliable environment.
• Initially, companies tried the traditional approach: they added more hardware or upgraded
to faster hardware as it became available.
• When that didn’t work, they tried to scale exist-ing relational solutions by simplifying their
database schema, de-normalizing the schema, relaxing durability and referential integrity,
introducing various query caching layers, separating read-only from write-dedicated replicas,
and, finally, data partitioning in an attempt to address these new requirements.
• None fundamentally addressed the core limitations, and they all introduced addi-tional
overhead and technical tradeoffs.

CAP Theorem – Consistency, Availability, Partition Tolerance


In 2000 Eric Brewer proposed the idea that in a distributed system you can’t continually
maintain perfect consistency, availability, and parti-tion tolerance simultaneously.
CAP is defined as:
• Consistency: all nodes see the same data at the same time.
• Availability: a guarantee that every request receives a response about whether it was
successful or failed.
• Partition tolerance: the system continues to operate despite arbitrary message loss.
NoSQL Advantages :
• High scalability
• Distributed Computing
• Lower cost
• Schema flexibility, semi-structure data
• No complicated Relationships
Disadvantages
• No standardization
• Limited query capabilities (so far)
• Eventual consistent is not intuitive to program for

NoSQL Categories
There are four general types (most common categories) of NoSQL databases. Each of these
categories has its own specific attributes and limitations.
There is not a single solutions which is better than all the others, however there are some
databases that are better to solve specific problems.
To clarify the NoSQL databases, lets discuss the most common categories :
• Key-value stores
• Column-oriented
• Graph
• Document oriented

5.18 VISUAL DATA ANALYTICS


Visual analytics is the science of analytical reasoning supported by interactive visual
interfaces.
Visual analytics methods allow decision makers to combine their flexibility, creativity, and
background knowledge with the enormous storage and processing capacities of today’s
computers to gain insight into complex problems.
Visual analytics evolved from information visualization and automatic data analysis. It
combines both former independent fields and strongly encourages human interaction in the
analysis process.
Visualization is the communication of data through the use of interactive interfaces and has
three major goals:
a) presentation to efficiently and effectively communicate the results of an analysis,
b) confirmatory analysis as a goal-oriented examination of hypotheses, and
c) exploratory data analysis as an interactive and usually undirected search for structures and
trends

Visual analytics is more than only visualization. It can rather be seen as an integral approach
combining visualization, human factors, and data analysis. Visualization and visual analytics
both integrate methodology from information analytics, geospatial analytics, and scientific
analytics.
5.19 Visual Analytics Process
The visual analytics process is a combination of automatic and visual analysis methods with a
tight coupling through human interaction in order to gain knowledge from data.
In many visual analytics scenarios, heterogeneous data sources need to be integrated before
visual or automatic analysis methods can be applied.
Therefore, the first step is often to preprocess and transform the data in order to extract
meaningful units of data for further processing. Typical preprocessing tasks are data cleaning,
normalization, grouping, or integration of heterogeneous data into a common schema.
Continuing with this meaningful data, the analyst can select between visual or automatic
analysis methods. After mapping the data the analyst may obtain the desired knowledge
directly, but more likely is the case that an initial visualization is not sufficient for the
analysis.
In contrast to traditional information visualization, findings from the visualization can be
reused to build a model for automatic analysis.
Once a model is created the analyst has the ability to interact with the automatic methods by
modifying parameters or selecting other types of analysis algorithms.
Model visualization can then be used to verify the findings of these models. Alternating
between visual and automatic methods is characteristic for the visual analytics process and
leads to a continuous refinement and verification of preliminary results.
Misleading results in an intermediate step can thus be discovered at an early stage, which
leads to more confidence in the final results.
5.20 DATA VISUALIZATION METHODS
Many conventional data visualization methods are often used. They are: table, histogram,
scatter plot, line chart, bar chart, pie chart, area chart, flow chart, bubble chart, multiple data
series or combination of charts, time line, Venn diagram, data flow diagram, and entity
relationship diagram, etc.
In addition, some data visualization methods have been used although they are less known
compared the above methods. The additional methods are: parallel coordinates, treemap, cone
tree, and semantic network,
• Parallel coordinates is used to plot individual data elements across many dimensions.
Parallel coordinate is very useful when to display multidimensional data.
• Treemap is an effective method for visualizing hierarchies. The size of each sub-rectangle
represents one measure, while color is often used to represent another measure of data.
• Cone tree is another method displaying hierarchical data such as organizational body in
three dimensions. The branches grow in the form of cone.
• A semantic network is a graphical representation of logical relationship between different
concepts. It generates directed graph, the combination of nodes or vertices, edges or arcs, and
label over each edge. Visualizations are not only static; they can be interactive. Interactive
visualization can be performed through approaches such as zooming (zoom in and zoom out),
overview and detail, zoom and pan, and focus and context or fish eye.
The steps for interactive visualization are as follows
1. Selecting: Interactive selection of data entities or subset or part of whole data or whole
data set according to the user interest.
2. Linking: It is useful for relating information among multiple views.
3. Filtering: It helps users adjust the amount of information for display. It decreases
information quantity and focuses on information of interest.
4. Rearranging or Remapping: Because the spatial layout is the most important visual
mapping, rearranging the spatial layout of the information is very effective in producing
different insights.
Big data visualization can be performed through a number of approaches such as more than
one view per representation display, dynamical changes in number of factors, and filtering
(dynamic query filters, star-field display, and tight coupling), etc
Several visualization methods were analyzed and classified according to data criteria: (1)
large data volume, (2) data variety, and (3) data dynamics.
• Treemap: It is based on space-filling visualization of hierarchical data.
• Circle Packing: It is a direct alternative to treemap. Besides the fact that as primitive shape
it uses circles, which also can be included into circles from a higher hierarchy level.
• Sunburst: It uses treemap visualization and is converted to polar coordinate system. The
main difference is that the variable parameters are not width and height, but a radius and arc
length.
• Parallel Coordinates: It allows visual analysis to be extended with multiple data factors for
different objects.
• Streamgraph: It is a type of a stacked area graph that is displaced around a central axis
resulting in flowing and organic shape.
• Circular Network Diagram: Data object are placed around a circle and linked by curves
based on the rate of their relativeness. The different line width or color saturation is usually
used to measure object relativeness.

5.21 VISUALIZATION TOOLS


Traditional data visualization tools are often inadequate to handle big data. Methods for
interactive visualization of big data were presented. First, a design space of scalable visual
summaries that use data reduction approaches (such as binned aggregation or sampling) was
described to visualize a variety of data types.
Methods were then developed for interactive querying (e.g., brushing and linking) among
binned plots through a combination of multivariate data tiles and parallel query processing.
The developed methods were implemented in imMens, a browser-based visual analysis
system that uses WebGL for data processing and rendering on the GPU.
A lot of big data visualization tools run on the Hadoop platform. The common modules in
Hadoop are: Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN,
and Hadoop MapReduce.
They analyze big data efficiently, but lack adequate visualization. Some software with the
functions of visualization and interaction for visualizing data has been developed.
• Pentaho: It supports the spectrum of BI functions such as analysis, dashboard, enterprise-
class reporting, and data mining.
• Flare: An ActionScript library for creating data visualization that runs in Adobe Flash
Player.
• JasperReports: It has a novel software layer for generating reports from the big data
storages.
• Dygraphs: It is quick and elastic open source JavaScript charting collection that helps
discover and understand opaque data sets.
• Datameer Analytics Solution and Cloudera: Datameer and Cloudera have partnered to
make it easier and faster to put Hadoop into production and help users to leverage the power
of Hadoop.
• Platfora: Platfora converts raw big data in Hadoop into interactive data processing engine.
It has modular functionality of in-memory data engine.
• ManyEyes: It is a visualization tool launched by IBM. Many Eyes is a public website where
users can upload data and create interactive visualization.
• Tableau: It is a business intelligence (BI) software tool that supports interactive and visual
analysis of data. It has an in-memory data engine to accelerate visualization.
Tableau has three main products to process large-scale datasets, including Tableau Desktop,
Tableau Sever, and Tableau Public. Tableau also embed Hadoop infrastructure. It uses Hive
to structure queries and cache information for in-memory analytics. Caching helps reduce the
latency of a Hadoop cluster. Therefore, it can provide an interactive mechanism between
users and Big Data applications.
At present, big data processing tools include Hadoop, High Performance Computing and
Communications, Storm, Apache Drill, RapidMiner, and Pentaho BI.
Data visualization tools include NodeBox, R, Weka, Gephi, Google Chart API, Flot, D3, and
Visual.ly, etc.
A big data visualization algorithm analysis integrated model based on Rhadoop was
proposed. The integrated model can process ZB and PB data and show valuable results via
visualization.

5.22 VISUAL DATA ANALYTICS APPLICATIONS


Visual analytics is essential in application areas where large information spaces have to be
processed and analyzed. Major application fields are physics and astronomy. Especially the
field of astrophysics offers many opportunities for visual analytics techniques:
Massive volumes of unstructured data, originating from different directions of space and
covering the whole frequency spectrum, form continuous streams of terabytes of data that can
be recorded and analyzed.
Monitoring climate and weather is also a domain which involves huge amounts of data
collected by sensors throughout the world and from satellites in short time intervals.
• A visual approach can help to interpret these massive amounts of data and to gain insight
into the dependencies of climate factors and climate change scenarios that would otherwise
not be easily identified.
• Besides weather forecasts, existing applications visualize the global warming, melting of
the poles, the stratospheric ozone depletion, as well as hurricane and tsunami warnings.
Emergency management
• In the domain of emergency management, visual analytics can help determining the on-
going progress of an emergency and identifying the next countermeasures (e.g., construction
of physical countermeasures or evacuation of the population) that must be taken to limit the
damage.
• Such scenarios can include natural or meteorological catastrophes like flood or waves,
volcanoes, storm, fire or epidemic growth of diseases (e.g. bird flu), but also human-made
technological catastrophes like industrial accidents, transport accidents or pollution.
Security and Geographics
• Visual analytics for security and geographics is an important research topic. The application
field in this sector is wide, ranging from terrorism informatics, border protection, path
detection to network security.
• Visual analytics supports investigation and detection of similarities and anomalies in large
data sets, like flight customer data, GPS tracking or IP traffic data.
Biology and Medicine
• In biology and medicine, computer tomography, and ultrasound imaging for 3-dimensional
digital reconstruction and visualization produce gigabytes of medical data and have been
widely used for years.
• The application area of bio-informatics uses visual analytics techniques to analyze large
amounts of biological data. From the early beginning of sequencing, scientist in these areas
face unprecedented volumes of data, like in the Human Genome Project with three billion
base pairs per human.
• Other new areas like Proteomics (studies of the proteins in a cell), Metabolomics
(systematic study of unique chemical fingerprints that specific cellular processes leave
behind) or combinatorial chemistry with tens of millions of compounds even enlarge the
amount of data every day.
• A brute-force computation of all possible combinations is often not possible, but interactive
visual approaches can help to identify the main regions of interest and exclude areas that are
not promising.
Business Intelligence.
• Another major application domain for visual analytics is business intelligence. The financial
market with its hundreds of thousands of assets generates large amounts of data every day,
which accumulate to extremely high data volumes throughout the years.
• The main challenge in this area is to analyze the data under multiple perspectives and
assumptions to understand historical and current situations, and then monitoring the market to
forecast trends or to identify recurring situations.

6. Question Bank Unit Wise


6. 1 QUESTION BANK OF UNIT 1
1. Define Data analytics in detail.
2. Define different types of data analytics.
3. What are the characteristics of data analytics?
4. What are difference between Report and Analytics?
5. What are the applications of data analytics explain in detail?
6. Explain classification of Data Analytics.
7. What are the key roles for successful analytic projects? Explain various phases
of data analytics lifecycle

6. 2 Question Bank of Unit 2


1. What are the problems faced if clustering exists in non-Euclidean space.
2. State the use of a priori algorithm in data mining.
3. Write short notes on.
a. CLIQUE
b. FRIQUENT PATTERN BASED CLUSTERING METHOD
c. Hierarchical
d. fuzzy logic
4. What is Bayesian Network? With an example, Explain how this network can
be used for analyzing data.
5. Describe steps involved in support vector space based inference mythology.
6. List out the various steps of PROCLUS clustering algorithm and its
significances.
7. Explain K-Means algorithms. When would you use k means? State weather
the statement “ K- Means has an assumption each cluster has a roughly equal
number of observations’ is true or false Justify your answer.

8. Distinguish between supervised and unsupervised learning.


9. Given data = {2, 3, 4, 5, 6, 7; 1, 5, 3, 6, 7, 8}. Compute the
principal component using PCA Algorithm.
1. Explain the fuzzy logic is being implemented for image processing

6.3 QUESTION BANK OF UNIT 3

1. What are the key roles for successful analytic project?


2. Illustrate the various Real Time Analytic Platform (RTAP) applications.
3. Explain in detail about the Market-Basket Model with example.
4. Identify the major issues in Data stream query processing.
5. Discriminate the concept of sampling data in a stream.
6. Explain in detail about the real-time sentimental analysis and stock prediction
with example.
7. List out the various type of technologies in (RTAP).
8. List out the various application of data stream.
9. Differentiate between data stream mining and traditional data mining.

6.4 QUESTION BANK OF UNIT 4


1. Define frequent pattern based clustering method.
2. Define any four type of data analysis.
3. What are the key roles for successful analytic project?
4. What are the problems faced if clustering exists in non-Euclidean space.
5. State the use of Apriori algorithm in data mining
6. List out the various steps of PROCLUS clustering algorithm and its
significances.
7. Write short notes on
e. CLIQUE
f. FRIQUENT PATTERN BASED CLUSTERING METHOD
g. Hierarchical, K-means clustering
h. fuzzy logic
6.5 QUESTION BANK OF UNIT 5

1. Describe the architecture of HIVE with its feature.


2. Brief about the main components of Map Reduce.
3. Draw and explain the architecture of Data Stream Model.
4. Illustrate the Hadoop distributed file system architecture with neat
diagram.
5. How is Hadoop different from other parallel computing systems?
6. What are the advantages of NoSQL over traditional RDBMS?
7. What do you understand by NoSQL databases? Explain
8. Describe characteristics of a NoSQL database.
9. Describe the structure of HDFS in a Hadoop ecosystem using a
diagram.
10. Write a short note on NoSQL databases. List the differences between
NoSQL and relational databases
11. What is HBase? List out and explain the basic concepts of HBase in
detail
12. Differentiate: Apache pig Vs Map Reduce
13. Define HDFS. Discuss the HDFS architecture and HDFS commands in
brief.
7.Multiple Choice Question Unit Wise
7.1 MCQ’s of Unit 1
1. Which of the following is not an example of Social Media?
1. Twitter
2. Google
3. Instagram
4. Youtube
2. Data Analysis is a process of
1. inspecting data
2. cleaning data
3. transforming data
4. All of Above
3. Which of the following is not a major data analysis approaches?
1. Data Mining
2. Predictive Intelligence
3. Business Intelligence
4. Text Analytics
4. The Process of describing the data that is huge and complex to store and process is known as
1. Analytics
2. Data mining
3. Big data
4. Data warehouse
5. ____ have a structure but cannot be stored in a database.
1. Structured
2. Semi Structured
3. Unstructured
4. None of these
6. ____ refers to the ability to turn your data useful for business
1. Velocity
2. variety
3. Value
4. Volume
7. Files are divided into ____ sized Chunks.
1. Static
2. Dynamic
3. Fixed
4. Variable
8. ____ is factors considered before Adopting Big Data Technology
1. Validation
2. Verification
3. Data
4. Design
9. for improving supply chain management to optimize stock management, replenishment, and forecasting
1. Descriptive
2. Diagnostic
3. Predictive
4. Prescriptive
10. which among the following is not a Data mining and analytical applications?
1. profile matching
2. social network analysis
3. facial recognition
4. Filtering

7.2MCQ’s of Unit 2
1. Which of the following is true about regression analysis?
a. answering yes/no questions about the data
b. estimating numerical characteristics of the data
c. modeling relationships within the data
d. describing associations within the data
2. What is a hypothesis?
1. A statement that the researcher wants to test through the data collected in a study.
2. A research question the results will answer.
3. A theory that underpins the study.
4. A statistical method for calculating the extent to which the results could have
happened by chance.
3. What is the cyclical process of collecting and analysing data during a single research study
called?
1. Interim Analysis
2. Inter analysis
3. inter item analysis
4. constant analysis
4. Which of the following is not a major data analysis approaches?
1. Data Mining
2. Predictive Intelligence
3. Business Intelligence
4. Text Analytics
5. The Process of describing the data that is huge and complex to store and process is known
as
1. Analytics
2. Data mining
3. Big data
4. Data warehouse
6 . Which of the following is true about regression analysis?
1. answering yes/no questions about the data
2. estimating numerical characteristics of the data
3. modeling relationships within the data
4. describing associations within the data

7. Which of the following is a widely used and effective machine learning algorithm based on
the idea of bagging?
A. Decision Tree
B. Regression
C. Classification
D. Random Forest
8. PCA is a ________.

A. Non linear method


B. Linear method
C. Continuous method
D. Repeated method
9. _________ is non-zero vector that stays parallel after matrix multiplication.

A. Eigen value
B. Eigen vector
C. Linear value
D. None of these
10. __________is a dimensionality reduction technique which is commonly used for the
supervised classification problems.

A. Value analysis
B. Function Analysis
C. Pure analysis
D. None of these
11.The predictions for generative learning algorithms are made using _______ .

A. Naive Theorem
B. Bayes Theorem
C. Naive Bayes Theorem
D. None of these
12. ________is an important factor in predictive modeling

A. Dimensionality Reduction
B. feature selection
C. feature extraction
D. None of these
13 . What would you do in PCA to get the same projection as SVD?
A. transform data to zero mean
B. transform data to zero median
C. not possible
D. none of these
14. . What is the full form of BN in Neural Networks?
A. Bayesian Networks
B. Belief Networks
C. Bayes Nets
D. All of the above
.

15. . What is Neuro software?

A. A software used to analyze neurons


B. It is powerful and easy neural network
C. Designed to aid experts in real world
D. It is software used by Neurosurgeon
7.3MCQ’s of Unit 3

1. Which of the following are example(s) of


Real Time Big Data Processing?
a. Complex Event Processing (CEP)
platforms
b. Stock market data analysis
c. Bank fraud transactions detection
d. both (a) and (c)
2. Which of the following is not an example
of Social Media?
a. Twitter
b. Google
c. Insta
d. Youtube

3. In Filtering Streams____________
a. Accept those tuples in the stream that meet a criterion.
b. Accept data in the stream that meet a criterion.
c. Accept those class in the stream that meet a criterion.
d. Accept rows in the stream that meet a criterion

4. A Bloom filter consists of_________


a. An array of n bits, initially all 0’s.
b. An array of 1 bits, initially all 0’s.
c. An array of 2 bits, initially all 0’s.
d. An array of n bits, initially all 1’s.

5. The purpose of the Bloom filter is to allow____________


a. through all stream elements whose keys are in Set
b. through all stream elements whose keys are in class
c. through all data elements whose keys are in Set
d. through all touple elements whose keys are in Set

6. Which attribute is not indicative for data streaming?


a. Limited amount of memory
b. Limited amount of processing time
c. Limited amount of input data
d. Limited amount of processing power

7. In DGIM,whenever forming a bucket then_____


Every bucket should have at least one 1, else no bucket can be formed
Every bucket should have at least two 1, else no bucket can be formed
Every bucket should have at least three 1, else no bucket can be formed
Every bucket should have at least four 1, else no bucket can be formed

8. Which of the following statements about standard Bloom filters is correct?


a. It is possible to delete an element from a Bloom filter.
b. A Bloom filter always returns the correct result.
c. It is possible to alter the hash functions of a full Bloom filter to create more space.
d. A Bloom filter always returns TRUE when testing for a previously added element.

9. Real-time data stream is _______


a. sequence of data items that arrive in some order and may be seen only once.
b. sequence of data items that arrive in some order and may be seen twice.
c. sequence of data items that arrive in same order
d. sequence of data items that arrive in different order

10. What are DGIM’s maximum error boundaries?


DGIM always underestimates the true count; at most by 25%
DGIM either underestimates or overestimates the true count; at most by 50%
DGIM always overestimates the count; at most by 50%
DGIM either underestimates or overestimates the true count; at most by 25%

7.4MCQ’s of Unit 4
1. The number of iterations in apriori _
1. increases with the size of the data
2. decreases with the increase in size of the data
3. increases with the size of the maximum frequent set
4. decreases with increase in size of the maximum frequent set
2. Which of the following are interestingness measures for association rules?
1. Recall ‘
2. Lift
3. Accuracy
4. All of Above
3. _______ is an example for case based-learning
1. Decision trees
2. Neural networks
3. Genetic algorithm
4. K-nearest neighbor
4. Which of the following is finally produced by Hierarchical Clustering?
a) final estimate of cluster centroids
b) tree showing how close things are to each other
c) assignment of each point to clusters
d) all of the mentioned
5. Which of the following is required by K-means clustering?
a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of the mentioned

6._______is the task of dividing the population or data points into a number of groups.
A) Unsupervised learning
B) clustering
C) semi supervised
D) classification

7.Which of the following is not clustering method?


A) Density-Based
B) Hierarchical Based
C) Grid-based
D) Project Based

8.Agglomerative has _________ approach


A) top down
B) bottom up
C) down up
D) None of these

9._____model in which we will fit the data on the probability that how it may belong to
the same distribution.
A) Centroid based methods
B) distribution based model
C) Connectivity based methods
D) None of these
10._______is basically a type of unsupervised learning method
A) Unsupervised learning
B) clustering
C) semi supervised
D) classification
7.5 MCQ’s of Unit 5
1.What license is Hadoop distributed under?
a) Apache License 2.0
b) Mozilla Public License
c) Shareware
d) Commercial
2. Which of the following platforms does Hadoop run on?
a) Bare metal
b) Debian
c) Cross-platform
d) Unix-like
3. IBM and ________ have announced a major initiative to use Hadoop to support university
courses in distributed computer programming.
a) Google Latitude
b) Android (operating system)
c) Google Variations
d) Google
4. HDFS works in a __________ fashion.
a) master-slave
b) worker/slave
c) None of the above
d) all of the mentioned
5. Which of the following scenario may not be a good fit for HDFS?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file
b) HDFS is suitable for storing data related to applications requiring low latency data access
c) HDFS is suitable for storing data related to applications requiring low latency data access
d) None of the mentioned
6. The need for data replication can arise in various scenarios like ____________
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
d) All of the mentioned
7. The minimum amount of data that HDFS can read or write is called a _____________.
a) Datanode
b) Namenode
c) Block
d) None of the above

8. Which of the following is not Features Of HDFS?


a) It is suitable for the distributed storage and processing.
b) Streaming access to file system data.
c) HDFS provides file permissions and authentication.
d)Hadoop does not provides a command interface to interact with HDFS

9. In HDFS the files cannot be


a) read
b) deleted
c) executed
d) Archived

10. Which of the following is the most popular high-level Java API in Hadoop Ecosystem?
a) Cascading
b) Scalding
c) Hcatalog
d) Cascalog

8. Previous year Question papers


8.1 Year -2021
Subject Code: KCS051
Roll No:

B TECH
(SEM-V) THEORY
EXAMINATION 2020-21 DATA
ANALYTICS
Time: 3 Hours Total Marks:
100
Note: 1. Attempt all Sections. If require any missing data; then choose suitably.
SECTION A
1. Attempt all questions in brief. 2 x 10 = 20

Q no. Question Mark CO


s
a. What are the different types of data? 2 1
b. Explain decision tree. 2 1
c. Give the full form of RTAP. 2 3
d. List various phases of data analytics lifecycle. 2 1
e. Explain the role of Name Node in Hadoop. 2 5
f. Discuss heartbeat in HDFS. 2 5
g. Differentiate between an RDBMS and Hadoop. 2 5
h. Write names of two visualization tools. 2 4
i. How can you deal with uncertainty? 2 3
j. Data sampling is very crucial for data analytics. Justify the statement. 2 3
SECTION B
2. Attempt any three of the following:

Q no. Question Mark CO


s
a. Explain K-Means algorithms. When would you use k means? State 10 4
weather the statement “K-Means has an assumption each cluster has a
roughly equal number of observations” is true or false. Justify your
answer
b. Illustrate and explain the steps involved in Bayesian data analysis. 10 2
c. Suppose that A, B, C, D, E and F are all items. For a particular support 10 1
threshold, the maximal frequent item sets are {A, B, C} an {D, E}.
What is the negative border?
d. Discuss any two techniques used for multivariate analysis. 10 2
e. Design and explain the architecture of data stream model. 10 3
SECTION C
3. Attempt any one part of the following:
Q no. Question Mark CO
s
a. Describe the architecture of HIVE with its features. 10 5
b. Brief about the main components of MapReduce 10 5
4. Attempt any one part of the following:

Q no. Question Mark CO


s
a. Describe any two data sampling techniques. 10 1
b. Explain any one algorithm to count number of distinct elements in a 10 3
Data stream.

5. Attempt any one part of the following:

Q no. Question Mark CO


s
a. Brief about the working of CLIQUE algorithm. 10 4
b. Cluster the following eight points (with (x, y) representing locations) 10 4
into three clusters: A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5),
A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are A1(2, 10), A4(5, 8) and A7(1, 2). The
distance function between two points a = (x1, y1) and b = (x2, y2) is
defined as- Ρ (a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the
second iteration
6. Attempt any one part of the following:

Q no. Question Mark CO


s
a. What is prediction error? State and explain the prediction error in 10 4
regression and classification with suitable example.
b. Given data = {2, 3, 4, 5, 6, 7; 1, 5, 3, 6, 7, 8}. Compute the principal 10 2
component using PCA Algorithm.
7. Attempt any one part of the following:

Q no. Question Mark CO


s
a. Develop and explain the data analytics life cycle 10 1
b. Distinguish between supervised and unsupervised learning with 10 1
example.

9. NPTEL Lectures Link


9.1 Link 1 Introduction to Data Analytics
https://www.youtube.com/watch?v=La-
NZ6jOfoQ&list=PLRueFtKLr0QN7MmQ8pdpQerOe_s8vGJG4&index=1
9.2 Link 2 Supervised Learning
https://www.youtube.com/watch?v=JU7pT7efEzQ&list=PLRueFtKLr0QN7MmQ8p
dpQerOe_s8vGJG4&index=18
9.3 Link 3Logistic Regression
https://www.youtube.com/watch?v=kfFT4iTCDjg&list=PLRueFtKLr0QN7MmQ8pdpQerO
e_s8vGJG4&index=25
9.4 Link 4 Support Vector Machines
https://www.youtube.com/watch?v=FG0TcQrWN5k&list=PLRueFtKLr0QN7MmQ8
pdpQerOe_s8vGJG4&index=31
9.5 Link 5 Artificial Neural Network
https://www.youtube.com/watch?v=ssvdhOMzO_A&list=PLRueFtKLr0QN7MmQ8
pdpQerOe_s8vGJG4&index=36

You might also like