0% found this document useful (0 votes)
6 views13 pages

Data Science

BCA

Uploaded by

SPANDANA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Data Science

BCA

Uploaded by

SPANDANA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Unit-I

2marks
1) Define data mining.
=Data Mining refers to extracting or mining knowledge from large amount of data.
2) What is knowledge discovery in databases?
=Knowledge Discovery in databases is a process of identifying a valid, potentially useful and
ultimately understandable structure in data. This process involve selecting or sampling data from
data warehouse, cleaning and preprocessing it, transforming and reducing it, applying data
mining component to produce a structure and then evaluating the derived structure.
3) List any four stages of KDD.
= Data cleaning, Data integration, Data selection, Data transformation, .Data mining, Pattern
evaluation.
4) What is Loosely Coupled DBMS and Tightly Coupled DBMS?
= Tightly Coupled DBMS: In a tightly coupled DBMS, the components or subsystems are highly
integrated and interact closely with each other.
Loosely Coupled DBMS: In a loosely coupled DBMS, the components or subsystems within the
system operate relatively independently and communicate with each other through standardized
interfaces or protocols.
5) What is prediction and description in Data Mining?
= In data mining, a description refers to a summary or characterization of patterns, trends, or
relationships discovered in a dataset.
predictive data mining is used to make predictions about future events.
6) List any four discovery driven tasks.
=Clustering Analysis, Association rule mining, Anomaly detection, sequential pattern mining.
7) Define Support and Confidence.
= In data mining, support and confidence are key metrics used in association rule mining
Support :measures the frequency or occurrence of a particular itemset in a dataset.
Confidence :measures the reliability or strength of an association rule between two itemsets.
8) Write the objectives of clustering.
= Identifying Natural Groupings
Understanding Data Distribution
Data Reduction
Anomaly Detection
9) Define Rough Set.
= In data mining, rough set theory is a mathematical framework used for data analysis and
knowledge discovery, particularly in dealing with uncertainty and vagueness in data.
10) What is sequence mining and spatial data mining
= Sequential data mining involves the extraction of patterns or knowledge from sequential data,
where the order of data points is crucial. It's commonly used in fields like natural language
processing
Spatial data mining involves the extraction of patterns or knowledge from spatial data, which
typically includes geographical or spatial information. This can involve analyzing data such as
maps, satellite imagery, GPS data, and geographical databases to discover trends, relationships,
or anomalies within spatially distributed data.
11) Write the subtasks of web mining.
= Web Content Mining:
Web Structure Mining:
Web Usage Mining:
Web Data Integration:
Web Information Retrieval:
12) Expand KDT and IE.
=KDT-K Dimensional Tree
IE-information Extraction
13) What is web mining and text mining?
= Web mining is the process of extracting useful information or patterns from the World Wide
Web. It involves techniques from data mining, machine learning, and statistics to analyze web
data, including web content, structure, and usage logs. Web mining can be used for various
purposes such as improving search engine performance, understanding user behavior, and
extracting market intelligence.
Text mining, also known as text analytics, is the process of deriving high-quality information
from textual data. It involves analyzing unstructured text data to discover patterns, trends, and
insights. Text mining techniques typically include natural language processing (NLP), machine
learning, and statistical analysis to extract valuable information from large volumes of text.
14) List any two issues and challenges in data mining.
= Data Quality: Data mining heavily relies on the quality of the data being analyzed. Poor-
quality data, such as missing values, inconsistencies, or inaccuracies, can significantly impact the
results and conclusions drawn from the analysis.
Overfitting: Overfitting occurs when a data mining model captures noise or random fluctuations
in the training data rather than the underlying patterns.
15) List the two application areas of data mining.
 = Data mining in Education: ...
 Data Mining in Healthcare: ...
 Data Mining in Fraud Detection. ...
 Data Mining in Lie Detection. ...
 Data Mining in Market Basket Analysis.

16) Explain the use of Data Visualization?


= Data visualization is actually a set of data points and information that are represented
graphically to make it easy and quick for user to understand.
17) Define Machine Learning?
= Machine learning investigates how computers can learn (or improve their performance) based
on data. A main research area is for computer programs to automatically learn to recognize
complex patterns and make intelligent decisions based on data.
18) Define Supervised and Unsupervised Learning?
= Supervised learning is a machine learning method in which models are trained using labeled
data. In supervised learning, models need to find the mapping function to map the input variable
(X) with the output variable (Y).
Unsupervised learning is another machine learning method in which patterns inferred from the
unlabeled input data. The goal of unsupervised learning is to find the structure and patterns from
the input data. Unsupervised learning does not need any supervision.
19) Explain the two fundamental goals of Data Mining?
= Prediction involves using some variables or fields in the database to predict unknown or future
values of other variables of interest.
Description focuses on finding human-interpretable patterns describing the data.
20) Define Discovery Model? Give an example?
= Data Discovery is the process of identifying patterns, trends, and insights within a meaningful
dataset. It includes collecting data from various types of sources and then applying an advanced
Data Analytical technique for identifying the patterns and themes within the collected dataset.
An example of a discovery model in data mining is the recommendation system used by
streaming platforms like Netflix or Spotify.
21) What is Frequent Episodes? List its types?
= In data mining, frequent episodes refer to sequences of events that occur frequently within a
dataset. These sequences could represent patterns of behavior, occurrences in a time series, or
sequences of actions.
Types of frequent episodes include:

1.Sequential Pattern Mining


2.Episode Rule Mining
22) What is Deviation Detection? Give example?
= Deviation detection in data mining, also known as anomaly detection, is the process of
identifying patterns or instances that deviate from the norm or expected behavior within a
dataset. These deviations, or anomalies, can represent unusual events, errors, or potential fraud.
An example of deviation detection in data mining is fraud detection in financial transactions.

Unit 2
2 marks questions
1) What is data warehouse?
= A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema and that usually resides at a single site.
2) How are organizations using the information from data warehouses?
= A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its
analytical capabilities allow organizations to derive valuable business insights from their data to
improve decision-making.

3) Expand OLTP and OLAP.


= OLTP is Online Transaction Processing
Online analytical processing (OLAP)
4) What is data cube? How it is defined?
= When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional
databases," "materialized views," and "OLAP (On-Line Analytical Processing)."
5) Define dimensions and facts.
= Dimensions: Dimensions represent the categorical attributes or descriptors by which data is
analyzed, categorized, or viewed. They provide context and descriptive information about the
data.
Facts: Facts represent the numerical, measurable data or metrics that are being analyzed or
reported on. They are typically numeric values that can be aggregated, summarized, or analyzed.
6) Define base cuboid and apex cuboid.
= In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest
level of summarization is called a base cuboid.
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex
cuboid.
7) Define measure and mention any two categories.
= In data mining, measures refer to the numerical values or metrics that are being analyzed or
predicted. Measures are typically the target variables or outcomes of interest in a data mining
task.
Data mining measures can be categorized or arranged into three categories: holistic,
distributive, and algebraic.
8) Define concept hierarchy. Give an example.
= In data mining, the concept of a concept hierarchy refers to the organization of data into a
tree-like structure, where each level of the hierarchy represents a concept that is more general
than the level below it.
9) Write any two differences between OLAP system and Statistical database.
=Purpose and Usage:
OLAP systems are designed for interactive analysis of multidimensional data, allowing users to
explore and analyze data from different perspectives, drill down into details, and perform
complex analytical queries.
Statistical databases, on the other hand, are optimized for storing and managing large volumes of
structured statistical data, often collected from surveys, experiments, or observational studies.
Data Structure and Representation:
OLAP systems use multidimensional data models to represent data in the form of cubes or
hypercubes, with dimensions representing different aspects of the data (e.g., time, product,
geography) and measures representing numerical metrics or KPIs.
Statistical databases typically store data in tabular or relational formats, where each row
represents an observation or case, and each column represents a variable or attribute.
10) What is Starnet Model? Give an example through diagram.
=A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions.

11) List any two benefits of Business analyst by having data warehouse.
 =Quickly analyze data for various business applications.
 Improve decision-making speed and efficiency.
 Maintain the accuracy and reliability of data.
 Reduce costs related to data storage and management.

12) List any two data warehouse models from the architecture point of view.
=1)ETL
2)data mart
3)logical data model
13) List any four back-end tools and utilities included in data warehouse.
=
14) Define data cleaning and data integration.
=Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly
formatted, duplicated, or insufficient data from a dataset.
Data Integration is a data preprocessing technique that combines data from multiple
heterogeneous data sources into a coherent data store and provides a unified view of the data.
15) Define data transformation and data reduction.
=Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information. Data transformation includes
data cleaning techniques and a data reduction technique to convert the data into the appropriate
form.
Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information.
16) What is Dimensionality Reduction? Mention any one method used for Dimensionality
Reduction.
=Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar information."
Principal component analysis (PCA)
Missing value ratio
17) List any two methods for data discretization.
= There are two forms of data discretization first is supervised discretization, and the second is
unsupervised discretization.
18) Write the difference between Operational database and Data Warehouse?
=

Operational Database Data Warehouse

Data warehousing frameworks are regularly


Operational frameworks are outlined to
outlined to back high-volume analytical
back high-volume exchange preparing.
processing (i.e., OLAP).

operational frameworks are more often Data warehousing frameworks are ordinarily
than not concerned with current data. concerned with verifiable information.

Data inside operational frameworks are Non-volatile, unused information may be


basically overhauled frequently agreeing to included routinely. Once Included once in a
need. while changed.

It is outlined for investigation of commerce


It is planned for real-time commerce
measures by subject range, categories, and
managing and processes.
qualities.

Relational databases are made for on-line Data Warehouse planned for on-line
value-based Preparing (OLTP) Analytical Processing (OLAP)

Operational frameworks are ordinarily Data warehousing frameworks are more


optimized to perform quick embeds and often than not optimized to perform quick
overhauls of cooperatively little volumes of recoveries of moderately tall volumes of
data. information.

Data In Data out


Operational Database Data Warehouse

Operational database systems are generally While data warehouses are generally subject-
application-oriented. oriented.

19) Define ROLAP and MOLAP?


=ROLAP stands for Relational Online Analytical Processing.
MOLAP stands for Multidimensional Online Analytical Processing.
20) What is Data mart? List its types?
=A Data Mart is a subset of a directorial information store, generally oriented to a specific
purpose or primary data subject which may be distributed to provide business needs.

here are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts


o Independent Data Marts

21) Explain Virtual Warehouse?


=Virtual Data Warehouses is a set of perception over the operational database. For effective
query processing, only some of the possible summary vision may be materialized. A virtual
warehouse is simple to build but required excess capacity on operational database servers.
22) What type of information a Metadata Repository should include?
= metadata includes such details as date created, file size and type, and archiving requirements.
23) What is Metadata Repository?
=A metadata repository is a database created to store metadata. Metadata is information about
the structures that contain the actual data. Metadata is often said to be "data about data", but this
is misleading.
24) Explain the approaches to data cleaning as a process?

=1. Remove duplicate or irrelevant observations

2. Fix structural errors

3. Filter unwanted outliers


4. Handle missing data

5. Validate and QA

25) Define Binning?


=Data binning, also called discrete binning or bucketing, is a data pre-processing technique used
to reduce the effects of minor observation errors. It is a form of quantization. The original data
values are divided into small intervals known as bins, and then they are replaced by a general
value calculated for that bin.
26) What are the use of Data scrubbing tool and Data Auditing tool?
=Data Auditing Tools:
Data auditing involves monitoring and tracking changes to the data over time to ensure data
quality, compliance, and security.
Data Scrubbing Tools:
Data scrubbing, also known as data cleansing or data cleaning, is the process of identifying and
correcting errors, inconsistencies, and inaccuracies in the data.
27) What is Entity Identification Problem?
=Entity Identification Problem occurs during the data integration. During the integration of
data from multiple resources, some data resources match each other and they will become
reductant if they are integrated.
28) List the processes involved in data transformation?
=1. Data Smoothing
2. Attribute Construction
3. Data Aggregation
4. Data Normalization
5. Data Discretization
6. Data Generalization
29) What is the use of Attribute Subset Selection?
=Attribute subset Selection is a technique which is used for data reduction in data mining
process. Data reduction reduces the size of data so that it can be used for analysis purposes more
efficiently.
30) What is Histograms? Mention the use of Multidimensional histograms
=A histogram is a graphical representation of the frequency distribution of continuous series
using rectangles. The x-axis of the graph represents the class interval, and the y-axis shows the
various frequencies corresponding to different class intervals.

Multidimensional histograms are used in data science for analyzing and visualizing the
distribution of multidimensional data.
Unit IV
2 marks questions
1)What is Classification?
=Classification is a process of categorizing data or objects into predefined classes or categories
based on their features or attributes.
2) What is prediction?
=In data science, prediction refers to the process of using data and statistical algorithms to make
informed guesses about future outcomes or trends based on historical data.
3) What is regression Analysis?
=Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
4) Write steps involved in data classification?
=1. Perform A Risk Assessment For Sensitive Data
2. Establish A Data Classification Policy3. Categorize The Types Of Data
4.Identify Data Locations
5. Identify And Classify Data
6. Use Results To Improve Security And Compliance
5) What is supervised learning?
=Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
6) What is unsupervised learning?
=Unsupervised learning is a type of machine learning that learns from unlabeled data. This
means that the data does not have any pre-existing labels or categories. The goal of unsupervised
learning is to discover patterns and relationships in the data without any explicit guidance.
7) What is correlation Analysis?
=Correlation analysis is a statistical method used to measure the strength of the linear
relationship between two variables and compute their association. Correlation analysis calculates
the level of change in one variable due to the change in the other.
8) What is Decision tree induction?
=The goal of decision tree induction is to build a model that can accurately predict the outcome
of a given event, based on the values of the attributes in the dataset.
9) What is Decision tree?
=Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
10) What is tree pruning?
=Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
11) What is the use of tree selection measures?
=
12) Expand CART, ID3
=Classification and regression trees (CART)
ID3, refers to Iterative Dichotomizer 3
13) What is splitting criterion? List distinct splitting criteria’s values.
=
14) What is attribute selection measure? List popular attribute measures.
=Attribute selection measure (ASM) is a criterion used in decision tree algorithms to evaluate the
usefulness of different attributes for splitting a dataset.
15) What is information Gain? Write the formula to find expected information needed to
classify a tuple.
= Information Gain is the attribute selection measure that is used to find/select the best attribute in
a dataset or used to find the root node. The attribute with the highest information gain value is
selected as a root node for splitting criteria.
Gain(A) = Info(D) − Info A(D). where Gain(A) is information Gain of Attribute A , info(D)=−X
m i=1 pi log2 (pi), is entropy or expected information. Info A (D) = Xv j=1 |Dj | /|D| × Info(Dj).
16) What is entropy?
=Entropy is the measure of the degree of randomness or uncertainty in the dataset. In the case of
classifications, It measures the randomness based on the distribution of class labels in the dataset.
17) Define gain ratio. Write formula.
= Gain Ratio is a measure that takes into account both the information gain and the number of
outcomes of a feature to determine the best feature to split on.
Gain Ratio = Information Gain / Split Info
18) Define Gini index. Write formula.
=The Gini Index is a proportion of impurity or inequality in statistical and monetary settings. In
machine learning, it is utilized as an impurity measure in decision tree algorithms for
classification tasks. The Gini Index measures the probability of a haphazardly picked test being
misclassified by a decision tree algorithm, and its value goes from 0 (perfectly pure) to 1
(perfectly impure).
19) List the use of MDL.
=
20) Write two approaches to tree pruning.
= Pre-pruning Approach
Post-pruning Approach

21) What is pessimistic pruning?


= The Pessimistic Error Pruning algorithm is a top-down pruning algorithm. This means we start
at the root and then, for each decision node, we check from top to bottom to determine whether it
is relevant for the final result or should be pruned.
22) Expand and briefly explain BOAT.
=
23) What is eager learners? Give examples.
= eager learning precomputes a model during training, making predictions faster but
potentially requiring more memory.Eg:
Decision Trees, Support Vector Machines (SVM), Neural Networks
24) What is Instance based Learners?
= instance-based learning are the systems that learn the training examples by heart and then
generalizes to new instances based on some similarity measure. It is called instance-based
because it builds the hypotheses from the training instances.
25) What is clustering? Write the advantages.
= A way of grouping the data points into different clusters, consisting of similar data points. The
objects with the possible similarities remain in a group that has less or no similarities with
another group."
26) What is the purpose of Portioning Algorithm?
=
27) Expand CLARA and ROCK.
= CLARA(Clustering for Large Application)
ROCK-RObust Clustering using LinKs,
28) Expand BIRCH and DBSCAN.
= BIRCH (balanced iterative reducing and clustering using hierarchies)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

29) Expand OPTICS and DENCLUE.


= OPTICS Clustering stands for Ordering Points To Identify Cluster Structure.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
30) What is STING clustering.
= Statistical Information Grid(STING):A STING is a grid-based clustering technique. It uses a
multidimensional grid data structure that quantifies space into a finite number of cells.
31) What is grid based clustering? Give examples.
= A STING is a grid-based clustering technique. It uses a multidimensional grid data structure
that quantifies space into a finite number of cells. Instead of focusing on data points, it focuses
on the value space surrounding the data points.

You might also like