0% found this document useful (0 votes)
26 views84 pages

Data Mining First Draft

mnbjhbhjbjhbjhbjhhbjhbjhbhjbjhbmnbmn

Uploaded by

shashanks2493
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views84 pages

Data Mining First Draft

mnbjhbhjbjhbjhbjhhbjhbjhbhjbjhbmnbmn

Uploaded by

shashanks2493
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Data Mining E -text:

Lecture 1:
1. What are foundations of data mining?
Generally, we use it for a long process of research and product
development. Also, we
can say this evolution was started when business data was first
stored on computers.
We can also navigate through their data in real time. Data Mining is
also popular in the
business community. As this is supported by three technologies that
are now mature:
Massive data collection, Powerful multiprocessor computers, and
Data mining
algorithms.
2. What is Data Mining?
Data mining may be defined as the application of statistical methods
to potentially quite
diverse data sets, in order to clarify relationships perhaps including
some previously
unknown to estimate their strength, or to accumulate data about real
or hypothetical
entities such as a person, a group of persons, a commercial
enterprise, or other entities
or events The results may then be used to make statements about
the real or estimated
characteristics of these entities, or to test hypotheses related to one
or more of the
systems with which they interact. Data mining relies on several
assumptions. These
include: (1) that one has access to a sufficient amount of data to be
useful for one’s
purposes—often associated with business, government, or research
interests, but
sometimes just idle curiosity; (2) that there is reason to believe much
of the data can be
regarded analytically as “noise” but one or more “signals” of interest
can be found by
intelligent searching or “mining;” (3) that the use of various analytical
tools,
predominantly statistical tools, can extract and amplify these signal(s)
and distinguish
them in some reliable manner from the noise, and (4) that the
uncertainties surrounding
the conclusions drawn from any such analysis are examined and
deemed acceptable.
The signals, of course, are then applied to the problem at hand, such
as identifying
potentially profitable customers or business locations, risk factors for
diseases,
unauthorized use of credit cards, incidents of bioterrorism, or
terrorist suspects.
3. What types of data are being mined?
Data that are mined pertain to individuals, businesses, or natural
events or conditions
such as weather patterns or contamination. The types of personal
data that are mined
include age, race, sex, marital status, income, education, medical
history, genetic
information, employment, travel itinerary, and buying patterns. The
data pertaining to
individuals may be specific to an identified person; may be
anonymized by removing
direct identifiers such as name, address, or social security number; or
may be
aggregated over geographic, demographic, or other variables. These
types of data
come from sources such as internal government records supporting a
program or
activity, government records classified as public and open to review,
and customer
transaction records obtained by business.

In determining whether data are sufficient to be mined for a specific


purpose, those
planning or implementing any data mining activity must consider
reliability of the data,
estimates, and linkages involved in relation to the intended uses and
the possible
consequences to the individual should the results turn out to be
wrong
4. What are the beneficial uses of data mining and what potential
threats do these
activities pose?
Beneficial uses of data mining serve the public interest--for example,
in the form of more
efficient provision of goods and services by the government, by not-
for-profit
organizations, and by the private sector. In the federal government,
the three most
common applications of data mining are for improvements in service
and performance;
detecting fraud, waste, and abuse; and analyzing scientific and
research information.
Understanding patterns in the failure rates of ship parts, for example,
is key to creating
and maintaining a supply line capable of ensuring that the fleet will
not be disabled or
impaired because parts are not available. Within the Veterans
Administration,
compensation and pension data are regularly mined to detect
patterns that are
indicative of abuse or fraud, making the allocation of benefits fairer
for all.
All data mining strategies that use information on human systems are
potentially
abusive, both by having individual information disclosed without
consent and by linking
records in databases that separately are not a threat to privacy but
together give
organizations the capacity to identify specific persons. It follows that
the more complex
the systems of linked databases the more serious are the threats to
privacy and the
more numerous the ethical dilemmas. It follows further that, as
organizations pursue
plans to become more efficient in their delivery of goods, services,
and messages that
support their cause, the inherent ethical dangers become
increasingly inevitable. Most
important, resolving the dilemmas involved is made even more
difficult by the fact that
laws regarding breaches of confidentiality have not done a good job
of setting limits on
the database development being used to create predictive models.
5. What is the scope of data mining?
Automated prediction of trends and behaviours- We use to automate
the process of
finding predictive information in large databases. Also, questions that
required extensive
hands-on analysis can now be answered from the data. Moreover,
targeted marketing is
a typical example of predictive marketing. As we also use data mining
on past
promotional mailings. Automated discovery of previously unknown
patterns – As we
use data mining tools to sweep through databases. Also, to identify
previously hidden
patterns in one step. Basically, there is a very good example of
pattern discovery. As it
is the analysis of retail sales data. Moreover, that is to identify
unrelated products that
are often purchased together.
6. What are the advantages of data mining?
Basically, to find probable defaulters, we use data mining in banks
and financial
institutions. Also, this is done based on past transactions, user
behaviour and data
patterns.

Generally, it helps advertisers to push the right advertisements to the


internet. Also, it
surfer on web pages based on machine learning algorithms.
Moreover, this way data
mining benefit both possible buyers as well as sellers of the various
products.
Basically, the retail malls and grocery stores peoples used it. Also, it is
to arrange and
keep most sellable items in the most attentive positions.
7. What is required for technological drivers in data mining?
Database size: Basically, as for maintaining and processing the huge
amount of data,
we need powerful systems.
Query Complexity: Generally, to analyze the complex and large
number of queries, we
need a more powerful system.
8. A brief introduction to data mining knowledge discovery?
Generally, most people don’t differentiate data mining from
knowledge discovery. While
others view data mining as an essential step in the process of
knowledge discovery.
9. What are issues in data mining?
A number of issues that need to be addressed by any serious data
mining
Package
Uncertainty Handling
Dealing with Missing Values
Dealing with Noisy data
Efficiency of algorithms
Constraining Knowledge Discovered to only Useful
Incorporating Domain Knowledge
Size and Complexity of Data
Data Selection
Understandably of Discovered Knowledge: Consistency between Data
and
Discovered Knowledge.
10.What are the major elements of data mining, explain?
Generally, helps in an extract, transforms and loads transaction data
onto the data
warehouse system.
While it stores and manages the data in a multidimensional database
system.
Also, provides data access to business analysts and information
technology
Professionals.
Generally, analyzes the data by application software.

While, it shows the data in a useful format, such as a graph or table

Lecture 2:
1. What is a data warehouse?
A data warehouse is a collection of data primarily used for reporting
and analysis. It’s a
way to provide business analysts and other users with a centralized
repository of
accurate, enterprise data from which to glean timely analytic insights
that guide
everyday business decisions.
Data typically flows into the data warehouse from transactions
systems, databases, and
other internal and external data sources. Data warehouses usually
include historical
data derived from transactions, but can also include real-time
streaming data as well as
other sources.
2. What is a data mart?
A data mart is a subset of data (typically a subset of a data
warehouse) that is focused
on a specific area of the business. When users need only data about
one subject area
or department sometimes a data mart is the answer. Think of it as
providing a boutique
store versus a giant warehouse for their data needs. A data mart
versus a data
warehouse can make it easier for business users and analysts to
locate and discover
the answers they need more quickly.
3. What is big data?
Big data is the term used to describe extremely large or complex data
sets that require
advanced analytical tools to analyze. Unlike traditional data, big data
typically exhibits
five characteristics that set it apart from other data sets: volume,
variety, velocity,
variability and veracity.
It’s these characteristics that make big data so valuable for innovation
and insight. For
instance, it can feed machine-learning algorithms to make artificial
intelligence-driven
processes smarter. Big data can also be integrated with smaller
enterprise data sets for
more comprehensive, granular insights.
4. What is metadata?
Metadata is data about data. It describes the structure of the data
warehouse, data
vault or data mart. Metadata captures all the information necessary
to move and
transform data from source systems into your data infrastructure and
helps users
understand and interpret the data in the data infrastructure. It is also
the foundation for
aspects such as documentation and lineage.

5. What is the difference between ETL and ELT?


ETL (extraction, transformation, and loading) is a technique for
extracting and moving
information from upstream source systems to downstream targets.
Traditionally, ETL
was used to move data from source systems to a data warehouse,
with transformation
(reformatting) happening outside of the data warehouse.
ELT, on the other hand, approaches data movement differently by
using the target data
platform to transform the data. The data is copied to the target and
transformed in
place. Because ELT doesn’t need a transformation engine, it is a more
flexible and agile
approach that is less complex and time-consuming than ETL to
develop and maintain.

Lecture 3:
1. What is OLAP?
OLAP means On Line Analytical Processing, and is a way of analyzing
large amounts of
multi-dimensional data in real time by performing various operations
on data dimensions
2. Are there any 3-D cubes or hyper-cubes on the screen?
Not ‘per se’. OLAP cube is merely a conceptual notion for organizing
and displaying
data that can be ‘viewed by something’: by time-period, by
geography, by type, etc. The
word “by” is an indication that the data is dimensional.
3. What types of data can be stored?
Kinetica organizes data in a manner similar to a standard relational
database. Each
database consists of tables, each defined by a schema. Data is
strongly typed for each
field in the schema and can be double, float, int, long, string, or
bytes.
If you are using our native API, your interface to the system is that of
an object-based
datastore, with each object corresponding to a row in the table.
4. Where is the data stored?
Kinetica stores data in-memory. It is able to utilize both system RAM
and vRAM. (the
memory available on the GPU card itself).
The benefit to storing data in vRAM is that the transfer time is very,
very fast. The
downside is that vRAM is expensive and limited in capacity–currently
24GB on an
NVIDIA K80.
For larger datasets, system RAM allows the database to scale to much
larger volumes
and scale out across many nodes of standard hardware. Data stored
in main memory
can be efficiently fed to the GPU, and this process is even more
efficient with the
NVLink architecture.

Lecture 4:
1. Where is the big data trend going?
Eventually the big data hype will wear off, but studies show that big
data adoption will
continue to grow. With a projected $16.9B market by 2015, it is clear
that big data is
here to stay. However, the big data talent pool is lagging behind and
will need to catch
up to the pace of the market. McKinsey & Company estimated in May
2011 that by
2018, the US alone could face a shortage of 140,000 to 190,000
people with deep
analytical skills as well as 1.5 million managers and analysts with the
know-how to use
the analysis of big data to make effective decisions.
The emergence of big data analytics has permanently altered many
businesses’ way of
looking at data. Big data can take companies down a long road of
staff, technology, and
data storage augmentation, but the payoff – rapid insight into never-
before-examined
data – can be huge. As more use cases come to light over the coming
years and
technologies mature, big data will undoubtedly reach critical mass
and will no longer be
labeled a trend. Soon it will simply be another mechanism in the BI
ecosystem.
2. Who are some of the BIG DATA users?
From cloud companies like Amazon to healthcare companies to
financial firms, it seems
as if everyone is developing a strategy to use big data. For example,
every mobile
phone user has a monthly bill which catalogs every call and every
text; processing the
sheer volume of that data can be challenging. Software logs, remote
sensing
technologies, information-sensing mobile devices all pose a challenge
in terms of the
volumes of data created. The size of Big Data can be relative to the
size of the
enterprise. For some, it may be hundreds of gigabytes, for others,
tens or hundreds of
terabytes to cause consideration.
3. What is Hadoop?
The Apache Hadoop software library allows for the distributed
processing of large data
sets across clusters of computers using a simple programming model.
The software
library is designed to scale from single servers to thousands of
machines; each server

using local computation and storage. Instead of relying on hardware


to deliver high-
availability, the library itself handles failures at the application layer.
As a result, the
impact of failures is minimized by delivering a highly-available service
on top of a cluster
of computers. Hadoop is a distributed computing platform written in
Java. It incorporates
features similar to those of the Google File System and of
MapReduce.
4. Who supports and funds Hadoop?
Hadoop is one of the projects of the Apache Software Foundation.
The main Hadoop
project is contributed to by a global network of developers. Sub-
projects of Hadoop are
supported by the world’s largest Web companies, including Facebook
and Yahoo.

5. Why is Hadoop popular?


Hadoop’s popularity is partly due to the fact that it is used by some
of the world’s largest
Internet businesses to analyze unstructured data. Hadoop enables
distributed
applications to handle data volumes in the order of thousands of
exabytes.
6. Where does Hadoop find applicability in business?
Hadoop, as a scalable system for parallel data processing, is useful for
analyzing large
data sets. Examples are search algorithms, market risk analysis, data
mining on online
retail data, and analytics on user behavior data.
Hadoop’s scalability makes it attractive to businesses because of the
exponentially
increasing nature of the data they handle. Another core strength of
Hadoop is that it can
handle structured as well as unstructured data, from a variable
number of sources.
7. How has Hadoop evolved over the years?
Hadoop originally derives from Google’s implementation of a
programming model called
MapReduce. Google’s MapReduce framework could break down a
program into many
parallel computations, and run them on very large data sets, across a
large number of
computing nodes. An example use for such a framework is search
algorithms running
on Web data.
Hadoop, initially associated only with web indexing, evolved rapidly
to become a leading
platform for analyzing big data. Cloudera, an enterprise software
company, began
providing Hadoop-based software and services in 2008.
In 2012, GoGrid, a cloud infrastructure company, partnered with
Cloudera to accelerate
the adoption of Hadoop-based business applications. Also in 2012,
Dataguise, a data
security company, launched a data protection and risk assessment
tool for Hadoop.
WEEK 2 :
Lecture 1:

1. What is Datawarehousing?
A Datawarehouse is the repository of a data and it is used for
Management decision
support system. Datawarehouse consists of wide variety of data that
has high level of
business conditions at a single point in time.
In single sentence, it is repository of integrated information which
can be available for
queries and analysis.
2. What is Data Mining in Healthcare?
The industry collects a dazzling array of data, most of them are
electronic health
records (EHRs) collected by HIPAA covered health care facilities.
According to a survey
published by PubMed, data mining is becoming increasingly popular
in healthcare, if not
increasingly essential. The huge amounts of data generated by
healthcare EDI
transactions cannot be processed and analyzed using traditional
methods because of
the complexity and volume of the data.
Data mining provides the methodology and technology for
healthcare organizations to:
evaluate treatment effectiveness
save lives of patients using predictive medicine
manage healthcare at different levels
3. What is Fraud detection?
There are many definitions for fraud, depending on the point of view
con- sidering.
According to The American Heritage Dictionary, Second College
Edition, fraud is
defined as ’a deception deliberately practiced in order to secure
unfair of unlawful gain’.
Davia et al. (2000) paraphrase this in a number of items that must be
identified, when
articulating a case of fraud:
a victim
details of the deceptive act thought to be fraudulent
4. What are the fraud detection softwares?
The Fraud Detection models are oriented to detect fraudulent
transactions in a financial
data stream (e.g., that related to credit cards).

Lecture 2:
1. Explain about data warehouse architecture?
The data warehouse architecture is based on a relational database
management
system. In the data warehouse architecture, operational data and
processing are
completely separate from data warehouse processing. This central
information
repository is surrounded by a number of key components designed
to make the entire
environment functional, manageable and accessible by both the
operational systems
that source data into the warehouse and by an end-user query and
analysis tools.
2. What is ETL?
ETL is abbreviated as Extract, Transform and Load. ETL is a software
which is used to
reads the data from the specified data source and extracts a desired
subset of data.
Next, it transform the data using rules and lookup tables and convert
it to a desired
state.
Then, load function is used to load the resulting data to the target
database.
3. What is Datamart?
A Datamart is a specialized version of Datawarehousing and it
contains a snapshot of
operational data that helps the business people to decide with the
analysis of past
trends and experiences. A data mart helps to emphasizes on easy
access to relevant
information.
4. What is called data cleaning?
Name itself implies that it is a self explanatory term. Cleaning of
Orphan records, Data
breaching business rules, Inconsistent data and missing information
in a database.
5. What is Metadata?
Metadata is defined as data about the data. The metadata contains
information like
number of columns used, fix width and limited width, ordering of
fields and data types of
the fields.
6. What are the languages used in Data cleansing?
R – Programming language, SQL– Structure Query Language, Advance
Excel Macros.

Lecture 3:
1. What are the tools available for ETL?
Following are the ETL tools available:
Informatica
Data Stage
Oracle
Warehouse Builder
Ab Initio
Data Junction
2. What are the fundamental skills of a Data Architect?
The fundamental skills of a Data Architect are as follows:
1. The individual should possess knowledge about data modeling in
detail
2. Physical data modeling concepts
3. Should be familiar with ETL process
4. Should be familiar with Data warehousing concepts
5. Hands-on experience with data warehouse tools and different
software
6. Should have experience in terms of developing data strategies
7. Build data policies and plans for executions
3. How to become a data architect?
The following are the prerequisites for an individual to start his
career in Data Architect.
1) A bachelor's degree is essential and preferably in computer science
background
2) No predefined certifications are necessary, but it is always good to
have few
certifications related to the field because few of the companies might
expect. It is
advisable to go through CDMA (Certified 3. Data Management
Professional)
3) Should have at least 3-8 years of IT experience.
4) Should be creative, innovative and good at problem-solving.
5) Has good programming knowledge and data modeling concepts
6) Should be well versed with the tools like SOA, ETL, ERP, XML etc
4. Differentiate between dimension and attribute?
In short, dimensions are nothing but which represents qualitative
data. For example
data like a plan, product, class are all considered as dimensions.
The attribute is nothing but a subset of a dimension. Within a
dimension table, we will
have attributes. The attributes can be textual or descriptive. For
example, product name
and product category are nothing but an attribute of product
dimensions.
5. Explain the different data models that are available in detail?
There are three different kinds of data models that are available and
they are as follows:
1. Conceptual
2. Logical
3. Physical
Conceptual data model:
As the name itself implies that this data model depicts the high-level
design of the
available physical data.
Logical data model:
Within the logical model, the entity names, entity relationships,
attributes, primary keys
and foreign keys will show up.
Physical data model:
Based on this data model, the view will give out more information
and showcases how
the model is implemented in the database. All the primary keys,
foreign keys, tables
names and column names will be showing up.

Lecture 4:
1. Explain about data warehouse architecture?
The data warehouse architecture is based on a relational database
management
system. In the data warehouse architecture, operational data and
processing are
completely separate from data warehouse processing. This central
information
repository is surrounded by a number of key components designed
to make the entire
environment functional, manageable and accessible by both the
operational systems
that source data into the warehouse and by an end-user query and
analysis tools.
2. How can the Banking sector get benefited from data warehouse
technologies?
Data warehouse system is very much benefited in the banking
industry by processing
shares, investment report. This financial report can be collected from
different sources
and stored in a data warehouse. For investors shares performance
and improvement in
financial growth. Data warehouse technology is essential to store
report and track
reports for risk management, fraud management and to provide
loan’s credit card to get
more interest in support to the banking sector and industry.
3. How can a political leader get benefited from data warehouse
technologies?
Data warehouse system is very much benefited in the political
industry by processing
voter report. This voter report can be collected from different
sources and stored in a
data warehouse. For members, performance and improvement in
economic data
warehouse technology is essential to store report and track reports
for risk
management, fraud management, facilities to be provided all over
the country.
WEEK 3
Lecture 1:
1. What is Business Intelligence?
The term ‘Business Intelligence’ (BI) provided the user with data and
tool to answer any
decision making an important question of an organization, it can be
related to run the
business or part of a business. In short, the business intelligence is
used for reporting
the specified data of any business which is very important and using
which higher
management of any organization will take the decision for the growth
of their business.
Normally below decisions can be decided by any organization from
Business
Intelligence tool:
BI is used to determine whether a business is running as per plan.
BI is used to identify which things are actually going wrong.
BI is used to take and monitor corrective actions.
BI is used to identify the current trends of their business.
2. What are different stages and benefits of Business Intelligence?
There are following five stages of Business Intelligence:
Data Source: It is about extracting data from multiple data source.
Data Analysis: It is about providing proper analysis report based on
useful
knowledge from a collection of data.
Decision-Making Support: It is about to using information in the
proper way. It always
targets to provide proper graph on important events like take over,
market changes,
and poor staff performance.
Situation Awareness: It is about filtering out irrelevant information
and setting the
remaining information in the context of the business and its
environment.
Risk Management: It is about to discover that what corrective
actions might be
taken, or decisions made, at different times.
Following are different benefits of Business Intelligence:
Improving decisions making.
Speed up on decision making.
Optimizing internal business process.
Increase operational efficiency.
Helping or driving for new revenues.
Gaining an advantage in terms of competitive markets with another
close
competitor.
3. What are different Business Intelligence tools available in the
market?
There are a lot of intelligence tools available in the market, in
between them below are
most popular:

Oracle Business Intelligence Enterprise Edition (OBIEE)


Cognos
Microstrategy
SAS Business Intelligence
Business Object
Table
Microsoft Business Intelligence Tool
Oracle Hyperion System
4. Explain Fact and Dimension table with an example.
A Fact table is the center table in star schema of a data warehouse. It
actually holding
quantitative information for analysis, and maximum time it de-
normalized.
A dimension table is one of the important tables in star schema of
data warehouse,
which stores attribute, or dimension, that describe the objects in a
fact table.
Fact table mainly holds two types of columns. The foreign key column
allows joins with
dimension tables, and the major columns contain the data that is
being analyzed.
5. Why do we need Business Intelligence Architecture?
Much before an organization starts adopting a business intelligence
architecture, there
are series of indicators which accelerate the case for building a BI
system. There are
many important factors, but the key ones include:
• Backlog of business requests: IT department is under a lot of
pressure to fulfil the
report requests from various business users.
• Need for self-service BI: Business users are stuck as they need to
depend on IT
for even minor pieces of information. This hinders their decision-
making process
and forms a bottleneck for smooth operation.
• Messed up IT system: Silos of data, different data formats, disparate
data and
applications – these will form a complex IT system, building a justified
case for a
stronger BI infrastructure.
• Cost: Cost of maintaining information silos and feeding to huge
number of IT
resources for even small sets of data is detrimental to an
organization.
These factors push the organizations to build a business intelligence
architecture that
will seek to help them make better decisions. A solid architecture will
help in structuring
the process of improving business intelligence and helps implement
the Business
Intelligence strategy in a very cost effective way.
BI architecture, among other elements, often includes both
structured and unstructured
data. This data comes from both internal and external sources and
are transformed
from raw transaction data into logical information.

6. What are the Components of Business Intelligence Architecture


One mistake that top leaders of many organization make is think of
their BI system as
equivalent to front-end BI tools being used. Then there is another set
of technical geeks
who make lot of discussion about a business intelligence architecture
around some
fancy jargons without giving due importance to what exactly
comprises BI architecture.
The key elements of a business intelligence architecture are:
• Source systems
• ETL process
• Data modelling
• Data warehouse
• Enterprise information management (EIM)
• Appliance systems
• Tools and technologies

Lecture 2:

1. What are the difference between dbms schema and data mining
schema
RDBMS Schema
Used for OLTP systems
Highly Normalized
Difficult to understand and navigate
Difficult to extract and solve complex problems
DWH Schema
Used for OLAP systems
De-normalized
Easy to understand and navigate
Relatively easier in extracting the data and solving complex
problems
2. How are the dimension table designed
Most dimension tables are designed using Normalization principles
upto
2NF. In some instances they are further normalized to 3NF
3. What is fact table
Fact Table contains the measurements or metrics or facts of business
process.
4. What is a dimension table
Dimension tables contain attributes that describe fact records in the
fact
table. Dimension tables contain hierarchies of attributes that aid in
summarization. Example: Time Dimension, Product Dimension
5. What are the different type of schema used in data warehousing
Schema defines the type of structure that is used by the database to
hold on
some data that can be related or can be different.
There are three types of schema that exists in the database as
follows:
- nbsp; BUS schema: This is the schema that is composed of the
master file
that is totally confirmed with the dimension and with standardized
definition
including the facts.

- nbsp; Star schema: This is the schema that defines the organization
type of
the tables and it is used to retrieve the result from the database
quickly in a
controlled environment.
-nbsp; Snow flake schema: This schema is used to show the primary
dimension table that includes one or more dimensions that can be
joined.
Primary dimension table only allows the joining of the fact table.
6. What are the steps involved in designing a fact table
The fact table allows the data to be represented in the detailed
manner. It
has an association with the dimensional table.
The steps that are required to design the fact table includes:
- Identify the business process for analysis so that all the process can
be
defined and used with the complete details provided by the business
process.
- Identify the measures, constraints, facts and keys before designing
the fact
table. The questions have to be asked during the business process
regarding the table that has to be created and the purpose of it.
- Identify the dimensions for facts like product dimensions, location,
time,
etc. This phase also include the analysis of the components that are
required to be involved.
- List of the columns takes place that describe the dimension and the
lowest
level of detail is to be found out to be used in the fact table.

Lecture 3:
1. What are the business benefits? Have you figured out what you
can do with this?
Can you quantify and measure the benefits? Have you really worked
out what the
actual business problem is you are trying to solve?
This is an iterative set of questions and to my mind fits well with the
CRISP DM
methodology laid out back in 1996 but still regularly used in real life
projects.
2. What technical know-how do I need? How technical do my users
need to be? Can
business users make sense of this without an in-depth background in
statistics? Is
the interface easy to use without programming skills?
These questions were being asked before the data scientist job title
had been invented
and I find that they're still common concerns when I talk to clients
today. I am convinced
that intelligent, data-literate individuals in your business can learn to
be excellent
analysts with a little expert hand holding.
3. How clear will the results be?
The results of your analysis should be presented to business users in
plain English,
accompanied with graphs using terms that business users can easily
comprehend and
use. You may substitute 'management' for 'business users' here too...
4. How effective is the data handling? How much data can the system
deal with? Can
we mine against the database directly?
The debate probably still rages on in the face of big data, but these
are very real
considerations. How do organisations deal with massive volumes of
data that they're
collecting and make sure they are using it as effectively as possible?
But what
happened to the sampling methods of classical statistics? Have we
moved to a whole
new way of thinking about data analysis?
5. What support will be available? How will I manage once the
system is installed? Will
I need specific maintenance staff? Another database administrator?
This is much less of a concern these days than it was thirteen years
ago. Analytics
today is folded into day-to-day system management and shouldn’t
really need a
dedicated support infrastructure.
6. What about the business users?
The original article was making a point here about having numerous
people doing
analysis and the importance of ensuring that lots of different groups
with different
perspectives could mine the data. I’m not sure this has worked out as
expected. These
days I tend to see organisations with separated groups of insight
analysts and data

scientists who liaise with the business users but mine the data on
their behalf. It will be
interesting to see how this changes as the predictive analytics and big
data fields further
mature.

Lecture 4:
1. What is the difference between classification and prediction?
The decision tree is a classification model, applied to existing data. If
you apply it to new
data, for which the class is unknown, you also get a prediction of the
class. The
assumption is that the new data comes from the similar distribution
as the data you
used to build your decision tree. In many cases this is a correct
assumption and that is
why you can use the decision tree for building a predictive model.
2. What is Data Mining?
Data Mining is the process of finding new and potentially useful
knowledge from data.
A simpler definition is:
Data Mining is the art and science of finding interesting and useful
patterns in data.
See a basic introductory article "From Data Mining to Knowledge
Discovery in
Databases", U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, AI
Magazine 1996.
3. What are advantages of data mining?
Basically, to find probable defaulters, we use data mining in banks
and financial
institutions. Also, this is done based on past transactions, user
behaviour and data
patterns.
Generally, it helps advertisers to push the right advertisements to the
internet. Also, it
surfer on web pages based on machine learning algorithms.
Moreover, this way data
mining benefit both possible buyers as well as sellers of the various
products. Basically,
the retail malls and grocery stores peoples used it. Also, it is to
arrange and keep most
sellable items in the most attentive positions.
4. What are the cons of data mining?
Security: The time at which users are online for various uses, must be
important. They
do not have security systems in place to protect us. As some of the
data mining
analytics use software. That is difficult to operate. Thus they require a
user to have
knowledge based training. The techniques of data mining are not
100% accurate.
Hence, it may cause serious consequences in certain conditions.
5. What is data mining process?
Basically, data mining is the latest technology. Also, it is a process of
discovering
hidden valuable knowledge by analyzing a large amount of data.
Moreover, we have to
store that data in different databases. As data mining is a very
important process. It
becomes an advantage for various industries.
WEEK 4:
Lecture 1:
1. What are issues in data mining?
A number of issues that need to be addressed by any serious data
mining package
Uncertainty Handling
Dealing with Missing Values
Dealing with Noisy data
Efficiency of algorithms
Constraining Knowledge Discovered to only Useful
Incorporating Domain Knowledge
Size and Complexity of Data
Data Selection
Understandably of Discovered Knowledge: Consistency between Data
and Discovered
Knowledge.
2. What are major elements of data mining, explain?
Generally, helps in an extract, transform and load transaction data
onto the data
warehouse system.
While it stores and manages the data in a multidimensional database
system.
Also, provide data access to business analysts and information
technology
professionals.
Generally, analyze the data by application software.
While, it shows the data in a useful format, such as a graph or table
3. What are some major data mining methods and algorithms?
Data mining is the process of extracting useful data, trends and
patterns from a large
amount of unstructured data. Some of the top data mining methods
are as follows:
Analyzing classification, Association of data, Clustering ,Regression.

Lecture 2:
1. Why does loosely coupled architecture help to scale some types of
systems?
A loosely coupled architecture is generally helpful in scaling many
kinds of hardware
and software systems. This is one of the primary benefits of this type
of build. First,
loosely coupled systems are systems in which different components
or elements have
relatively little knowledge or interactive dependency on other parts
of the system. That
means they don't need as much close coordination – they may not
need to operate by
the same protocols, or be controlled by the same languages or
operating systems. All of
this can make for easier scaling or other changes where companies
need to make
alterations to the overall build of the system. For example,
companies may source
hardware parts in different ways, instead of having to order
everything from one
branded manufacturer.
Loosely coupled architectures can also allow for more independent
scaling. For
example, in a loosely coupled network, engineers could work on
improving the capacity
or performance of one node with less effect on the other nodes in
the system. The
rough idea is that these parts all work toward the same goals and
coordinate workflows,
but because they are less dependent, they can be scaled or adjusted
individually. Some
professionals refer to this as “horizontal scaling” or scaling at a
particular granular level.
This kind of functionality and versatility is important in modern
systems because
scalability is so much of a concern over time. Companies generally
start small and
grow. Their data needs grow as well. Whether they are utilizing cloud
providers or
working on scaling up a virtualized network system, they need to
understand how to
manage the growing pains that will inevitably occur. Even in a
modern hyperconverged
system where storage, computer and network elements are all
bundled together, similar
philosophies may still guide corporate planners in promoting better
scalability and a
more flexible hardware/software infrastructure.
2. What does Loose Coupling mean?
Loose coupling describes a coupling technique in which two or more
hardware and
software components are attached or linked together to provide two
services that are
not dependent on one another. This term is used to describe the
degree and intent of
interconnected but non-dependent components within an
information system.
Loose coupling is also known as low or weak coupling.
3. What does Tight Coupling mean?
Tight coupling is a coupling technique in which hardware and
software components are
highly dependent on each other. It is used to refer to the state/intent
of interconnectivity
between two or more computing instances in an integrated system.
Tight coupling is
also known as high coupling and strong coupling.

4. What is OLAP?
OLAP (Online Analytical Processing):
In the multidimensional model, we need to organize data into
multiple dimensions.
Although, each dimension contains multiple levels of abstraction
defined by concept
hierarchies. Also, OLAP provides a user-friendly environment for
interactive data
analysis.
5. Explain tiers in the tight-coupling data mining architecture?
First of all, we can define data layer as a database. This layer is an
interface for all data
sources.
While we use data mining application layer to retrieve data from a
database. Some
transformation routine has to perform here.
Front-end layer provides the intuitive and friendly user interface for
end-user.

Lecture 3:

1. What is data CLEANSING?


Information quality is the key consideration in determining the value
of the information.
The developer of the data warehouse is not usually in a position to
change the quality of
its underlying historic data, though a data warehousing project can
put spotlight on the
data quality issues and lead to improvements for the future. It is,
therefore, usually
necessary to go through the data entered into the data warehouse
and make it as error
free as possible.
This process is known as Data Cleansing.
Data Cleansing must deal with many types of possible errors. These
include missing
data and incorrect data at one source; inconsistent data and
conflicting data when two
or more source is involved. There are several algorithms followed to
clean the data,
which will be discussed in the coming lecture notes.
2. What are the types of concept hierarchies?
A concept hierarchy defines a sequence of mappings from a set of
low-level concepts to
higher-level, more general concepts. Concept hierarchies allow
specialization, or drilling
down, where by concept values are replaced by lower-level concepts.
3. How concept hierarchies are useful in data mining?
A concept hierarchy for a given numerical attribute defines a
discretization of the
attribute. Concept hierarchies can be used to reduce the data by
collecting and
replacing low-level concepts (such as numerical values for the
attribute age) with
higher-level concepts (such as youth, middle-aged, or senior).
Although detail is lost by
such data generalization, the generalized data may be more
meaningful and easier to
interpret.
4. How do you clean the data? (Nov/Dec 2011)
Data cleaning (or data cleansing) routines attempt to fill in missing
values, smooth out
noise while identifying outliers, and correct inconsistencies in the
data. For Missing
Values
1. Ignore the tuple
2. Fill in the missing value manually
3. Use a global constant to fill in the missing value
4. Use the attribute mean to fill in the missing value:
5. Use the attribute mean for all samples belonging to the same class
as the given tuple
6. Use the most probable value to fill in the missing value For Noisy
Data

1. Binning: Binning methods smooth a sorted data value by


consulting its values around
it.
2. Regression: Data can be smoothed by fitting the data to a function,
such as with
Regression
3. Clustering: Outliers may be detected by clustering, where similar
values are
organized into groups, or cluster.

Lecture 4:
1. What is Data reduction ?
It consists of following three tasks –
Dimensionality reduction - Attribute subset selection
Numerosity reduction - Tuple subset selection
Discretization - Reduce the cardinality of active domain
WEEK 5
Lecture 1:
1. What is Visualization?
Visualization is for depiction of data and to gain intuition about data
being observed. It
assists the analysts in selecting display formats, viewer perspectives
and data
representation schema
Mention some of the application areas of data mining
DNA analysis
Market analysis
Financial data analysis
Banking industry
Retail Industry
Health care analysis
Telecommunication industry
2. What is Concept hierarchy?
It reduce the data by collecting and replacing low level concepts
(such as numeric
values for the attribute age) by higher level concepts (such as young,
middle-aged, or
senior).
3. What is data mining query language?
It was proposed by Han, Fu, Wang, et al. for the DB Miner data
mining system.
Although, It was based on the Structured Query Language. These
query languages are
designed to support ad hoc and interactive data mining. Also, it
provides commands for
specifying primitives. We can use DMQL to work with databases and
data warehouses
as well. We can also use it to define data mining tasks. Particularly we
examine how to
define data warehouses and data marts in DMQL.
4. Explain useful data mining queries?
First of all, it helps to apply the model to new data, to make single
or multiple
predictions.
Also, you can provide input values as parameters, or in a batch.
While it gets a statistical summary of the data used for training.
Also, extract
patterns and rule of the typical case representing a pattern in the
model.
Also, helps in extracting regression formulas and other calculations
that explain
patterns.
Get the cases that fit a particular pattern.
Further, it retrieves details about individual cases used in the
model.
Also, it includes data not used in the analysis. Moreover, it retrains
a model by
adding new data or perform cross-prediction.

Lecture 2:
1. What is Association rule?
Association rule finds interesting association or correlation
relationships among a large
set of data items which is used for decision-making processes.
Association rules
analyzes buying patterns that are frequently associated or purchased
together.
2. What are the Applications of Association rule mining?
Basket data analysis, cross-marketing, catalog design, loss-leader
analysis, clustering,
classification, etc.
3. What is support and confidence in Association rule mining.
Support S is the percentage of transactions in D that contain AUB.
Confidence c is the
percentage of transactions in D containing A that also contain B.
Support ( A=>B)=
P(AUB)
Confidence (A=>B)=P(B/A)
4. What is the purpose of Apriori algorithm?
The name of the algorithm is based on the fact that the algorithm
uses prior knowledge
for find frequent
Item set
5. What are the two steps of Apriori algorithm?
Join step
Prune step

Lecture 3:
1. Apriori Algorithm – Pros and cons
Easy to understand and implement
Can use on large itemsets
Cons
At times, you need a large number of candidate rules. It can
become
computationally expensive.
It is also an expensive method to calculate support because the
calculation has to
go through the entire database.
2. How to Improve the Efficiency of the Apriori Algorithm?
Use the following methods to improve the efficiency of the apriori
algorithm.
Transaction Reduction – A transaction not containing any frequent
k-itemset
becomes useless in subsequent scans.
Hash-based Itemset Counting – Exclude the k-itemset whose
corresponding
hashing bucket count is less than the threshold is an infrequent
itemset.
There are other methods as well such as partitioning, sampling, and
dynamic itemset
counting.
WEEK 6
Lecture 1 (Coming Soon)
1. What is SQL?
SQL stands for Structured Query Language, and it is
used to communicate with the
Database. This is a standard language used to perform
tasks such as retrieval,
updation, insertion and deletion of data from a
database.
2. What is a Database?
Database is nothing but an organized form of data for
easy access, storing, retrieval
and managing of data. This is also known as structured
form of data which can be
accessed in many ways.
Example: School Management Database, Bank
Management Database.
3. What is the acronym of ACID?
ACID stands for Atomicity, Consistency, Isolation, and
Durability − commonly known as
ACID properties − in order to ensure accuracy,
completeness, and data integrity.

1. What is Pre Pruning?


A tree is pruned by halting its construction early. Upon
halting, the node becomes a leaf.
The leaf may hold the most frequent class among the
subset samples.
2. What is a Dbscan?
Density Based Spatial Clustering of Application Noise is
called as DBSCAN. DBSCAN
is a density based clustering method that converts the
high-density objects regions into
clusters with arbitrary shapes and sizes. DBSCAN
defines the cluster as a maximal set
of density connected points.

1. What is the general objective of Association Rules


mining
The objective of association rules mining is to discover
interesting relations between
objects in large databases.

1. Define Classification?
Classification is a function that classifies items in a
collection to target categories or
classes. The main goal of classification is to accurately
predict the target class labels for
each case in the data. For example, a classification
model could be used to identify the
loan applicants as low, medium, or high credit risks.
2. What is a classifier?
A classifier is a supervised function where the learned
attribute is categorical. It is used
to classify new records (data) by giving them the best
target attribute (prediction) after
the learning process. The target attribute can be one of
k class membership.
3. Different types of classifiers?
Classifiers are of the following types
Perception.
Naive Bayes.
Decision Tree.
Logistic Regression.
K-Nearest Neighbor.
Artificial Neural Networks/Deep Learning.
Support Vector Machine.
4. Why do we need classification in datamining?
Classification is a data mining function that classify the
items in a collection to target
categories or classes. The main aim of classification is
to accurately predict the target
class for each case in the data. Continuous, floating-
point values will indicate a
numerical value, rather than a categorical, target.
5. What is prediction?
Prediction is to identify data points purely on the
description of another related data
item. It is not necessarily related to future events but
the used variables are unknown.
Here the prediction in data mining is known as Numeric
Prediction. Regression analysis
is used for prediction.

1. What is classification and prediction?


The decision tree is a classification model, applied to
existing data. If you apply it to new
data, for which the class is unknown, you also get a
prediction of the class. ... If you use
a classification model to predict the treatment
outcome for a new patient, it would be
a prediction.
2. What is the difference between classification and
regression?
Regression and classification are categorized under the
same umbrella of supervised
machine learning. ... The main difference between
them is that the output variable
in regression is numerical (or continuous) while that
forclassification is categorical (or
discrete)
3. What is difference between correlation and
regression?
Regression describes how an independent variable is
numerically related to the
dependent variable. Correlation is used to represent
the linear relationship between two
variables. ... In correlation, there is no difference
between dependent and independent
variables i.e. correlation between x and y is similar to y
and x.
4. What is DMDW classification?
Classification is a data mining function that assigns
items in a collection to target
categories or classes. The goal of classification is to
accurately predict the target class
for each case in the data. For example, a classification
model could be used to identify
loan applicants as low, medium, or high credit risks.
5. Can a prediction be scientific?
A scientific theory which is contradicted by
observations and evidence will be rejected.
... Notions that make no testable predictions are
usually considered not to be part
of science (proto science or nescience) until testable
predictions can be made.

1. What do you mean by decision tree?


A decision tree is a graph that uses a branching method
in which every possible
outcome of a decision is illustrated. In general, they can
be used to assign time or other
values to possible outcomes so that decisions can be
automated.
2. How does decision tree work?
It breaks down a dataset into smaller and smaller
subsets while at the same time a
decision tree is developed correspondingly. The final
output will be a tree with decision
nodes and leaf nodes. A decision node has two or more
branches. Leaf node denotes a
classification or decision.
3. How do you create a decision tree?
Seven steps for Creating a Decision Tree
Start the tree. Draw a rectangle near the left edge of
the page to represent the first
node. ...
Add branches. ...
Add leaves. ...
Add more branches. ...
Complete the decision tree. ...
Terminate a branch. ...
Verify accuracy.
4. Why do we use decision tree?
They are frequently used in decision analysis system to
help to identify a strategy with
the highest likelihood to achieve a goal. Decision
making trees are an effective
technique because it has following advantages: Easy to
use and understand – They are
easy to create and visually simple to follow.
5. How does decision tree is used for classification?
Decision tree is a common method used in data mining.
The main goal is to create a
model that predicts the value of a target attribute
based on several input attribute.
A decision tree or a classification tree is a tree in which
each internal node is labeled
with an input feature.
6. What is a Decision Tree?
It’s a way of visualizing the different treatment options
and decision points that come up
during the course of illness; from diagnosis to the
completion of treatment.

You might also like