Data Mining First Draft
Data Mining First Draft
Lecture 1:
1. What are foundations of data mining?
Generally, we use it for a long process of research and product
development. Also, we
can say this evolution was started when business data was first
stored on computers.
We can also navigate through their data in real time. Data Mining is
also popular in the
business community. As this is supported by three technologies that
are now mature:
Massive data collection, Powerful multiprocessor computers, and
Data mining
algorithms.
2. What is Data Mining?
Data mining may be defined as the application of statistical methods
to potentially quite
diverse data sets, in order to clarify relationships perhaps including
some previously
unknown to estimate their strength, or to accumulate data about real
or hypothetical
entities such as a person, a group of persons, a commercial
enterprise, or other entities
or events The results may then be used to make statements about
the real or estimated
characteristics of these entities, or to test hypotheses related to one
or more of the
systems with which they interact. Data mining relies on several
assumptions. These
include: (1) that one has access to a sufficient amount of data to be
useful for one’s
purposes—often associated with business, government, or research
interests, but
sometimes just idle curiosity; (2) that there is reason to believe much
of the data can be
regarded analytically as “noise” but one or more “signals” of interest
can be found by
intelligent searching or “mining;” (3) that the use of various analytical
tools,
predominantly statistical tools, can extract and amplify these signal(s)
and distinguish
them in some reliable manner from the noise, and (4) that the
uncertainties surrounding
the conclusions drawn from any such analysis are examined and
deemed acceptable.
The signals, of course, are then applied to the problem at hand, such
as identifying
potentially profitable customers or business locations, risk factors for
diseases,
unauthorized use of credit cards, incidents of bioterrorism, or
terrorist suspects.
3. What types of data are being mined?
Data that are mined pertain to individuals, businesses, or natural
events or conditions
such as weather patterns or contamination. The types of personal
data that are mined
include age, race, sex, marital status, income, education, medical
history, genetic
information, employment, travel itinerary, and buying patterns. The
data pertaining to
individuals may be specific to an identified person; may be
anonymized by removing
direct identifiers such as name, address, or social security number; or
may be
aggregated over geographic, demographic, or other variables. These
types of data
come from sources such as internal government records supporting a
program or
activity, government records classified as public and open to review,
and customer
transaction records obtained by business.
Lecture 2:
1. What is a data warehouse?
A data warehouse is a collection of data primarily used for reporting
and analysis. It’s a
way to provide business analysts and other users with a centralized
repository of
accurate, enterprise data from which to glean timely analytic insights
that guide
everyday business decisions.
Data typically flows into the data warehouse from transactions
systems, databases, and
other internal and external data sources. Data warehouses usually
include historical
data derived from transactions, but can also include real-time
streaming data as well as
other sources.
2. What is a data mart?
A data mart is a subset of data (typically a subset of a data
warehouse) that is focused
on a specific area of the business. When users need only data about
one subject area
or department sometimes a data mart is the answer. Think of it as
providing a boutique
store versus a giant warehouse for their data needs. A data mart
versus a data
warehouse can make it easier for business users and analysts to
locate and discover
the answers they need more quickly.
3. What is big data?
Big data is the term used to describe extremely large or complex data
sets that require
advanced analytical tools to analyze. Unlike traditional data, big data
typically exhibits
five characteristics that set it apart from other data sets: volume,
variety, velocity,
variability and veracity.
It’s these characteristics that make big data so valuable for innovation
and insight. For
instance, it can feed machine-learning algorithms to make artificial
intelligence-driven
processes smarter. Big data can also be integrated with smaller
enterprise data sets for
more comprehensive, granular insights.
4. What is metadata?
Metadata is data about data. It describes the structure of the data
warehouse, data
vault or data mart. Metadata captures all the information necessary
to move and
transform data from source systems into your data infrastructure and
helps users
understand and interpret the data in the data infrastructure. It is also
the foundation for
aspects such as documentation and lineage.
Lecture 3:
1. What is OLAP?
OLAP means On Line Analytical Processing, and is a way of analyzing
large amounts of
multi-dimensional data in real time by performing various operations
on data dimensions
2. Are there any 3-D cubes or hyper-cubes on the screen?
Not ‘per se’. OLAP cube is merely a conceptual notion for organizing
and displaying
data that can be ‘viewed by something’: by time-period, by
geography, by type, etc. The
word “by” is an indication that the data is dimensional.
3. What types of data can be stored?
Kinetica organizes data in a manner similar to a standard relational
database. Each
database consists of tables, each defined by a schema. Data is
strongly typed for each
field in the schema and can be double, float, int, long, string, or
bytes.
If you are using our native API, your interface to the system is that of
an object-based
datastore, with each object corresponding to a row in the table.
4. Where is the data stored?
Kinetica stores data in-memory. It is able to utilize both system RAM
and vRAM. (the
memory available on the GPU card itself).
The benefit to storing data in vRAM is that the transfer time is very,
very fast. The
downside is that vRAM is expensive and limited in capacity–currently
24GB on an
NVIDIA K80.
For larger datasets, system RAM allows the database to scale to much
larger volumes
and scale out across many nodes of standard hardware. Data stored
in main memory
can be efficiently fed to the GPU, and this process is even more
efficient with the
NVLink architecture.
Lecture 4:
1. Where is the big data trend going?
Eventually the big data hype will wear off, but studies show that big
data adoption will
continue to grow. With a projected $16.9B market by 2015, it is clear
that big data is
here to stay. However, the big data talent pool is lagging behind and
will need to catch
up to the pace of the market. McKinsey & Company estimated in May
2011 that by
2018, the US alone could face a shortage of 140,000 to 190,000
people with deep
analytical skills as well as 1.5 million managers and analysts with the
know-how to use
the analysis of big data to make effective decisions.
The emergence of big data analytics has permanently altered many
businesses’ way of
looking at data. Big data can take companies down a long road of
staff, technology, and
data storage augmentation, but the payoff – rapid insight into never-
before-examined
data – can be huge. As more use cases come to light over the coming
years and
technologies mature, big data will undoubtedly reach critical mass
and will no longer be
labeled a trend. Soon it will simply be another mechanism in the BI
ecosystem.
2. Who are some of the BIG DATA users?
From cloud companies like Amazon to healthcare companies to
financial firms, it seems
as if everyone is developing a strategy to use big data. For example,
every mobile
phone user has a monthly bill which catalogs every call and every
text; processing the
sheer volume of that data can be challenging. Software logs, remote
sensing
technologies, information-sensing mobile devices all pose a challenge
in terms of the
volumes of data created. The size of Big Data can be relative to the
size of the
enterprise. For some, it may be hundreds of gigabytes, for others,
tens or hundreds of
terabytes to cause consideration.
3. What is Hadoop?
The Apache Hadoop software library allows for the distributed
processing of large data
sets across clusters of computers using a simple programming model.
The software
library is designed to scale from single servers to thousands of
machines; each server
1. What is Datawarehousing?
A Datawarehouse is the repository of a data and it is used for
Management decision
support system. Datawarehouse consists of wide variety of data that
has high level of
business conditions at a single point in time.
In single sentence, it is repository of integrated information which
can be available for
queries and analysis.
2. What is Data Mining in Healthcare?
The industry collects a dazzling array of data, most of them are
electronic health
records (EHRs) collected by HIPAA covered health care facilities.
According to a survey
published by PubMed, data mining is becoming increasingly popular
in healthcare, if not
increasingly essential. The huge amounts of data generated by
healthcare EDI
transactions cannot be processed and analyzed using traditional
methods because of
the complexity and volume of the data.
Data mining provides the methodology and technology for
healthcare organizations to:
evaluate treatment effectiveness
save lives of patients using predictive medicine
manage healthcare at different levels
3. What is Fraud detection?
There are many definitions for fraud, depending on the point of view
con- sidering.
According to The American Heritage Dictionary, Second College
Edition, fraud is
defined as ’a deception deliberately practiced in order to secure
unfair of unlawful gain’.
Davia et al. (2000) paraphrase this in a number of items that must be
identified, when
articulating a case of fraud:
a victim
details of the deceptive act thought to be fraudulent
4. What are the fraud detection softwares?
The Fraud Detection models are oriented to detect fraudulent
transactions in a financial
data stream (e.g., that related to credit cards).
Lecture 2:
1. Explain about data warehouse architecture?
The data warehouse architecture is based on a relational database
management
system. In the data warehouse architecture, operational data and
processing are
completely separate from data warehouse processing. This central
information
repository is surrounded by a number of key components designed
to make the entire
environment functional, manageable and accessible by both the
operational systems
that source data into the warehouse and by an end-user query and
analysis tools.
2. What is ETL?
ETL is abbreviated as Extract, Transform and Load. ETL is a software
which is used to
reads the data from the specified data source and extracts a desired
subset of data.
Next, it transform the data using rules and lookup tables and convert
it to a desired
state.
Then, load function is used to load the resulting data to the target
database.
3. What is Datamart?
A Datamart is a specialized version of Datawarehousing and it
contains a snapshot of
operational data that helps the business people to decide with the
analysis of past
trends and experiences. A data mart helps to emphasizes on easy
access to relevant
information.
4. What is called data cleaning?
Name itself implies that it is a self explanatory term. Cleaning of
Orphan records, Data
breaching business rules, Inconsistent data and missing information
in a database.
5. What is Metadata?
Metadata is defined as data about the data. The metadata contains
information like
number of columns used, fix width and limited width, ordering of
fields and data types of
the fields.
6. What are the languages used in Data cleansing?
R – Programming language, SQL– Structure Query Language, Advance
Excel Macros.
Lecture 3:
1. What are the tools available for ETL?
Following are the ETL tools available:
Informatica
Data Stage
Oracle
Warehouse Builder
Ab Initio
Data Junction
2. What are the fundamental skills of a Data Architect?
The fundamental skills of a Data Architect are as follows:
1. The individual should possess knowledge about data modeling in
detail
2. Physical data modeling concepts
3. Should be familiar with ETL process
4. Should be familiar with Data warehousing concepts
5. Hands-on experience with data warehouse tools and different
software
6. Should have experience in terms of developing data strategies
7. Build data policies and plans for executions
3. How to become a data architect?
The following are the prerequisites for an individual to start his
career in Data Architect.
1) A bachelor's degree is essential and preferably in computer science
background
2) No predefined certifications are necessary, but it is always good to
have few
certifications related to the field because few of the companies might
expect. It is
advisable to go through CDMA (Certified 3. Data Management
Professional)
3) Should have at least 3-8 years of IT experience.
4) Should be creative, innovative and good at problem-solving.
5) Has good programming knowledge and data modeling concepts
6) Should be well versed with the tools like SOA, ETL, ERP, XML etc
4. Differentiate between dimension and attribute?
In short, dimensions are nothing but which represents qualitative
data. For example
data like a plan, product, class are all considered as dimensions.
The attribute is nothing but a subset of a dimension. Within a
dimension table, we will
have attributes. The attributes can be textual or descriptive. For
example, product name
and product category are nothing but an attribute of product
dimensions.
5. Explain the different data models that are available in detail?
There are three different kinds of data models that are available and
they are as follows:
1. Conceptual
2. Logical
3. Physical
Conceptual data model:
As the name itself implies that this data model depicts the high-level
design of the
available physical data.
Logical data model:
Within the logical model, the entity names, entity relationships,
attributes, primary keys
and foreign keys will show up.
Physical data model:
Based on this data model, the view will give out more information
and showcases how
the model is implemented in the database. All the primary keys,
foreign keys, tables
names and column names will be showing up.
Lecture 4:
1. Explain about data warehouse architecture?
The data warehouse architecture is based on a relational database
management
system. In the data warehouse architecture, operational data and
processing are
completely separate from data warehouse processing. This central
information
repository is surrounded by a number of key components designed
to make the entire
environment functional, manageable and accessible by both the
operational systems
that source data into the warehouse and by an end-user query and
analysis tools.
2. How can the Banking sector get benefited from data warehouse
technologies?
Data warehouse system is very much benefited in the banking
industry by processing
shares, investment report. This financial report can be collected from
different sources
and stored in a data warehouse. For investors shares performance
and improvement in
financial growth. Data warehouse technology is essential to store
report and track
reports for risk management, fraud management and to provide
loan’s credit card to get
more interest in support to the banking sector and industry.
3. How can a political leader get benefited from data warehouse
technologies?
Data warehouse system is very much benefited in the political
industry by processing
voter report. This voter report can be collected from different
sources and stored in a
data warehouse. For members, performance and improvement in
economic data
warehouse technology is essential to store report and track reports
for risk
management, fraud management, facilities to be provided all over
the country.
WEEK 3
Lecture 1:
1. What is Business Intelligence?
The term ‘Business Intelligence’ (BI) provided the user with data and
tool to answer any
decision making an important question of an organization, it can be
related to run the
business or part of a business. In short, the business intelligence is
used for reporting
the specified data of any business which is very important and using
which higher
management of any organization will take the decision for the growth
of their business.
Normally below decisions can be decided by any organization from
Business
Intelligence tool:
BI is used to determine whether a business is running as per plan.
BI is used to identify which things are actually going wrong.
BI is used to take and monitor corrective actions.
BI is used to identify the current trends of their business.
2. What are different stages and benefits of Business Intelligence?
There are following five stages of Business Intelligence:
Data Source: It is about extracting data from multiple data source.
Data Analysis: It is about providing proper analysis report based on
useful
knowledge from a collection of data.
Decision-Making Support: It is about to using information in the
proper way. It always
targets to provide proper graph on important events like take over,
market changes,
and poor staff performance.
Situation Awareness: It is about filtering out irrelevant information
and setting the
remaining information in the context of the business and its
environment.
Risk Management: It is about to discover that what corrective
actions might be
taken, or decisions made, at different times.
Following are different benefits of Business Intelligence:
Improving decisions making.
Speed up on decision making.
Optimizing internal business process.
Increase operational efficiency.
Helping or driving for new revenues.
Gaining an advantage in terms of competitive markets with another
close
competitor.
3. What are different Business Intelligence tools available in the
market?
There are a lot of intelligence tools available in the market, in
between them below are
most popular:
Lecture 2:
1. What are the difference between dbms schema and data mining
schema
RDBMS Schema
Used for OLTP systems
Highly Normalized
Difficult to understand and navigate
Difficult to extract and solve complex problems
DWH Schema
Used for OLAP systems
De-normalized
Easy to understand and navigate
Relatively easier in extracting the data and solving complex
problems
2. How are the dimension table designed
Most dimension tables are designed using Normalization principles
upto
2NF. In some instances they are further normalized to 3NF
3. What is fact table
Fact Table contains the measurements or metrics or facts of business
process.
4. What is a dimension table
Dimension tables contain attributes that describe fact records in the
fact
table. Dimension tables contain hierarchies of attributes that aid in
summarization. Example: Time Dimension, Product Dimension
5. What are the different type of schema used in data warehousing
Schema defines the type of structure that is used by the database to
hold on
some data that can be related or can be different.
There are three types of schema that exists in the database as
follows:
- nbsp; BUS schema: This is the schema that is composed of the
master file
that is totally confirmed with the dimension and with standardized
definition
including the facts.
- nbsp; Star schema: This is the schema that defines the organization
type of
the tables and it is used to retrieve the result from the database
quickly in a
controlled environment.
-nbsp; Snow flake schema: This schema is used to show the primary
dimension table that includes one or more dimensions that can be
joined.
Primary dimension table only allows the joining of the fact table.
6. What are the steps involved in designing a fact table
The fact table allows the data to be represented in the detailed
manner. It
has an association with the dimensional table.
The steps that are required to design the fact table includes:
- Identify the business process for analysis so that all the process can
be
defined and used with the complete details provided by the business
process.
- Identify the measures, constraints, facts and keys before designing
the fact
table. The questions have to be asked during the business process
regarding the table that has to be created and the purpose of it.
- Identify the dimensions for facts like product dimensions, location,
time,
etc. This phase also include the analysis of the components that are
required to be involved.
- List of the columns takes place that describe the dimension and the
lowest
level of detail is to be found out to be used in the fact table.
Lecture 3:
1. What are the business benefits? Have you figured out what you
can do with this?
Can you quantify and measure the benefits? Have you really worked
out what the
actual business problem is you are trying to solve?
This is an iterative set of questions and to my mind fits well with the
CRISP DM
methodology laid out back in 1996 but still regularly used in real life
projects.
2. What technical know-how do I need? How technical do my users
need to be? Can
business users make sense of this without an in-depth background in
statistics? Is
the interface easy to use without programming skills?
These questions were being asked before the data scientist job title
had been invented
and I find that they're still common concerns when I talk to clients
today. I am convinced
that intelligent, data-literate individuals in your business can learn to
be excellent
analysts with a little expert hand holding.
3. How clear will the results be?
The results of your analysis should be presented to business users in
plain English,
accompanied with graphs using terms that business users can easily
comprehend and
use. You may substitute 'management' for 'business users' here too...
4. How effective is the data handling? How much data can the system
deal with? Can
we mine against the database directly?
The debate probably still rages on in the face of big data, but these
are very real
considerations. How do organisations deal with massive volumes of
data that they're
collecting and make sure they are using it as effectively as possible?
But what
happened to the sampling methods of classical statistics? Have we
moved to a whole
new way of thinking about data analysis?
5. What support will be available? How will I manage once the
system is installed? Will
I need specific maintenance staff? Another database administrator?
This is much less of a concern these days than it was thirteen years
ago. Analytics
today is folded into day-to-day system management and shouldn’t
really need a
dedicated support infrastructure.
6. What about the business users?
The original article was making a point here about having numerous
people doing
analysis and the importance of ensuring that lots of different groups
with different
perspectives could mine the data. I’m not sure this has worked out as
expected. These
days I tend to see organisations with separated groups of insight
analysts and data
scientists who liaise with the business users but mine the data on
their behalf. It will be
interesting to see how this changes as the predictive analytics and big
data fields further
mature.
Lecture 4:
1. What is the difference between classification and prediction?
The decision tree is a classification model, applied to existing data. If
you apply it to new
data, for which the class is unknown, you also get a prediction of the
class. The
assumption is that the new data comes from the similar distribution
as the data you
used to build your decision tree. In many cases this is a correct
assumption and that is
why you can use the decision tree for building a predictive model.
2. What is Data Mining?
Data Mining is the process of finding new and potentially useful
knowledge from data.
A simpler definition is:
Data Mining is the art and science of finding interesting and useful
patterns in data.
See a basic introductory article "From Data Mining to Knowledge
Discovery in
Databases", U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, AI
Magazine 1996.
3. What are advantages of data mining?
Basically, to find probable defaulters, we use data mining in banks
and financial
institutions. Also, this is done based on past transactions, user
behaviour and data
patterns.
Generally, it helps advertisers to push the right advertisements to the
internet. Also, it
surfer on web pages based on machine learning algorithms.
Moreover, this way data
mining benefit both possible buyers as well as sellers of the various
products. Basically,
the retail malls and grocery stores peoples used it. Also, it is to
arrange and keep most
sellable items in the most attentive positions.
4. What are the cons of data mining?
Security: The time at which users are online for various uses, must be
important. They
do not have security systems in place to protect us. As some of the
data mining
analytics use software. That is difficult to operate. Thus they require a
user to have
knowledge based training. The techniques of data mining are not
100% accurate.
Hence, it may cause serious consequences in certain conditions.
5. What is data mining process?
Basically, data mining is the latest technology. Also, it is a process of
discovering
hidden valuable knowledge by analyzing a large amount of data.
Moreover, we have to
store that data in different databases. As data mining is a very
important process. It
becomes an advantage for various industries.
WEEK 4:
Lecture 1:
1. What are issues in data mining?
A number of issues that need to be addressed by any serious data
mining package
Uncertainty Handling
Dealing with Missing Values
Dealing with Noisy data
Efficiency of algorithms
Constraining Knowledge Discovered to only Useful
Incorporating Domain Knowledge
Size and Complexity of Data
Data Selection
Understandably of Discovered Knowledge: Consistency between Data
and Discovered
Knowledge.
2. What are major elements of data mining, explain?
Generally, helps in an extract, transform and load transaction data
onto the data
warehouse system.
While it stores and manages the data in a multidimensional database
system.
Also, provide data access to business analysts and information
technology
professionals.
Generally, analyze the data by application software.
While, it shows the data in a useful format, such as a graph or table
3. What are some major data mining methods and algorithms?
Data mining is the process of extracting useful data, trends and
patterns from a large
amount of unstructured data. Some of the top data mining methods
are as follows:
Analyzing classification, Association of data, Clustering ,Regression.
Lecture 2:
1. Why does loosely coupled architecture help to scale some types of
systems?
A loosely coupled architecture is generally helpful in scaling many
kinds of hardware
and software systems. This is one of the primary benefits of this type
of build. First,
loosely coupled systems are systems in which different components
or elements have
relatively little knowledge or interactive dependency on other parts
of the system. That
means they don't need as much close coordination – they may not
need to operate by
the same protocols, or be controlled by the same languages or
operating systems. All of
this can make for easier scaling or other changes where companies
need to make
alterations to the overall build of the system. For example,
companies may source
hardware parts in different ways, instead of having to order
everything from one
branded manufacturer.
Loosely coupled architectures can also allow for more independent
scaling. For
example, in a loosely coupled network, engineers could work on
improving the capacity
or performance of one node with less effect on the other nodes in
the system. The
rough idea is that these parts all work toward the same goals and
coordinate workflows,
but because they are less dependent, they can be scaled or adjusted
individually. Some
professionals refer to this as “horizontal scaling” or scaling at a
particular granular level.
This kind of functionality and versatility is important in modern
systems because
scalability is so much of a concern over time. Companies generally
start small and
grow. Their data needs grow as well. Whether they are utilizing cloud
providers or
working on scaling up a virtualized network system, they need to
understand how to
manage the growing pains that will inevitably occur. Even in a
modern hyperconverged
system where storage, computer and network elements are all
bundled together, similar
philosophies may still guide corporate planners in promoting better
scalability and a
more flexible hardware/software infrastructure.
2. What does Loose Coupling mean?
Loose coupling describes a coupling technique in which two or more
hardware and
software components are attached or linked together to provide two
services that are
not dependent on one another. This term is used to describe the
degree and intent of
interconnected but non-dependent components within an
information system.
Loose coupling is also known as low or weak coupling.
3. What does Tight Coupling mean?
Tight coupling is a coupling technique in which hardware and
software components are
highly dependent on each other. It is used to refer to the state/intent
of interconnectivity
between two or more computing instances in an integrated system.
Tight coupling is
also known as high coupling and strong coupling.
4. What is OLAP?
OLAP (Online Analytical Processing):
In the multidimensional model, we need to organize data into
multiple dimensions.
Although, each dimension contains multiple levels of abstraction
defined by concept
hierarchies. Also, OLAP provides a user-friendly environment for
interactive data
analysis.
5. Explain tiers in the tight-coupling data mining architecture?
First of all, we can define data layer as a database. This layer is an
interface for all data
sources.
While we use data mining application layer to retrieve data from a
database. Some
transformation routine has to perform here.
Front-end layer provides the intuitive and friendly user interface for
end-user.
Lecture 3:
Lecture 4:
1. What is Data reduction ?
It consists of following three tasks –
Dimensionality reduction - Attribute subset selection
Numerosity reduction - Tuple subset selection
Discretization - Reduce the cardinality of active domain
WEEK 5
Lecture 1:
1. What is Visualization?
Visualization is for depiction of data and to gain intuition about data
being observed. It
assists the analysts in selecting display formats, viewer perspectives
and data
representation schema
Mention some of the application areas of data mining
DNA analysis
Market analysis
Financial data analysis
Banking industry
Retail Industry
Health care analysis
Telecommunication industry
2. What is Concept hierarchy?
It reduce the data by collecting and replacing low level concepts
(such as numeric
values for the attribute age) by higher level concepts (such as young,
middle-aged, or
senior).
3. What is data mining query language?
It was proposed by Han, Fu, Wang, et al. for the DB Miner data
mining system.
Although, It was based on the Structured Query Language. These
query languages are
designed to support ad hoc and interactive data mining. Also, it
provides commands for
specifying primitives. We can use DMQL to work with databases and
data warehouses
as well. We can also use it to define data mining tasks. Particularly we
examine how to
define data warehouses and data marts in DMQL.
4. Explain useful data mining queries?
First of all, it helps to apply the model to new data, to make single
or multiple
predictions.
Also, you can provide input values as parameters, or in a batch.
While it gets a statistical summary of the data used for training.
Also, extract
patterns and rule of the typical case representing a pattern in the
model.
Also, helps in extracting regression formulas and other calculations
that explain
patterns.
Get the cases that fit a particular pattern.
Further, it retrieves details about individual cases used in the
model.
Also, it includes data not used in the analysis. Moreover, it retrains
a model by
adding new data or perform cross-prediction.
Lecture 2:
1. What is Association rule?
Association rule finds interesting association or correlation
relationships among a large
set of data items which is used for decision-making processes.
Association rules
analyzes buying patterns that are frequently associated or purchased
together.
2. What are the Applications of Association rule mining?
Basket data analysis, cross-marketing, catalog design, loss-leader
analysis, clustering,
classification, etc.
3. What is support and confidence in Association rule mining.
Support S is the percentage of transactions in D that contain AUB.
Confidence c is the
percentage of transactions in D containing A that also contain B.
Support ( A=>B)=
P(AUB)
Confidence (A=>B)=P(B/A)
4. What is the purpose of Apriori algorithm?
The name of the algorithm is based on the fact that the algorithm
uses prior knowledge
for find frequent
Item set
5. What are the two steps of Apriori algorithm?
Join step
Prune step
Lecture 3:
1. Apriori Algorithm – Pros and cons
Easy to understand and implement
Can use on large itemsets
Cons
At times, you need a large number of candidate rules. It can
become
computationally expensive.
It is also an expensive method to calculate support because the
calculation has to
go through the entire database.
2. How to Improve the Efficiency of the Apriori Algorithm?
Use the following methods to improve the efficiency of the apriori
algorithm.
Transaction Reduction – A transaction not containing any frequent
k-itemset
becomes useless in subsequent scans.
Hash-based Itemset Counting – Exclude the k-itemset whose
corresponding
hashing bucket count is less than the threshold is an infrequent
itemset.
There are other methods as well such as partitioning, sampling, and
dynamic itemset
counting.
WEEK 6
Lecture 1 (Coming Soon)
1. What is SQL?
SQL stands for Structured Query Language, and it is
used to communicate with the
Database. This is a standard language used to perform
tasks such as retrieval,
updation, insertion and deletion of data from a
database.
2. What is a Database?
Database is nothing but an organized form of data for
easy access, storing, retrieval
and managing of data. This is also known as structured
form of data which can be
accessed in many ways.
Example: School Management Database, Bank
Management Database.
3. What is the acronym of ACID?
ACID stands for Atomicity, Consistency, Isolation, and
Durability − commonly known as
ACID properties − in order to ensure accuracy,
completeness, and data integrity.
1. Define Classification?
Classification is a function that classifies items in a
collection to target categories or
classes. The main goal of classification is to accurately
predict the target class labels for
each case in the data. For example, a classification
model could be used to identify the
loan applicants as low, medium, or high credit risks.
2. What is a classifier?
A classifier is a supervised function where the learned
attribute is categorical. It is used
to classify new records (data) by giving them the best
target attribute (prediction) after
the learning process. The target attribute can be one of
k class membership.
3. Different types of classifiers?
Classifiers are of the following types
Perception.
Naive Bayes.
Decision Tree.
Logistic Regression.
K-Nearest Neighbor.
Artificial Neural Networks/Deep Learning.
Support Vector Machine.
4. Why do we need classification in datamining?
Classification is a data mining function that classify the
items in a collection to target
categories or classes. The main aim of classification is
to accurately predict the target
class for each case in the data. Continuous, floating-
point values will indicate a
numerical value, rather than a categorical, target.
5. What is prediction?
Prediction is to identify data points purely on the
description of another related data
item. It is not necessarily related to future events but
the used variables are unknown.
Here the prediction in data mining is known as Numeric
Prediction. Regression analysis
is used for prediction.