Data Science
Data Science
2marks
1) Define data mining.
=Data Mining refers to extracting or mining knowledge from large amount of data.
2) What is knowledge discovery in databases?
=Knowledge Discovery in databases is a process of identifying a valid, potentially useful and
ultimately understandable structure in data. This process involve selecting or sampling data from
data warehouse, cleaning and preprocessing it, transforming and reducing it, applying data
mining component to produce a structure and then evaluating the derived structure.
3) List any four stages of KDD.
= Data cleaning, Data integration, Data selection, Data transformation, .Data mining, Pattern
evaluation.
4) What is Loosely Coupled DBMS and Tightly Coupled DBMS?
= Tightly Coupled DBMS: In a tightly coupled DBMS, the components or subsystems are highly
integrated and interact closely with each other.
Loosely Coupled DBMS: In a loosely coupled DBMS, the components or subsystems within the
system operate relatively independently and communicate with each other through standardized
interfaces or protocols.
5) What is prediction and description in Data Mining?
= In data mining, a description refers to a summary or characterization of patterns, trends, or
relationships discovered in a dataset.
predictive data mining is used to make predictions about future events.
6) List any four discovery driven tasks.
=Clustering Analysis, Association rule mining, Anomaly detection, sequential pattern mining.
7) Define Support and Confidence.
= In data mining, support and confidence are key metrics used in association rule mining
Support :measures the frequency or occurrence of a particular itemset in a dataset.
Confidence :measures the reliability or strength of an association rule between two itemsets.
8) Write the objectives of clustering.
= Identifying Natural Groupings
Understanding Data Distribution
Data Reduction
Anomaly Detection
9) Define Rough Set.
= In data mining, rough set theory is a mathematical framework used for data analysis and
knowledge discovery, particularly in dealing with uncertainty and vagueness in data.
10) What is sequence mining and spatial data mining
= Sequential data mining involves the extraction of patterns or knowledge from sequential data,
where the order of data points is crucial. It's commonly used in fields like natural language
processing
Spatial data mining involves the extraction of patterns or knowledge from spatial data, which
typically includes geographical or spatial information. This can involve analyzing data such as
maps, satellite imagery, GPS data, and geographical databases to discover trends, relationships,
or anomalies within spatially distributed data.
11) Write the subtasks of web mining.
= Web Content Mining:
Web Structure Mining:
Web Usage Mining:
Web Data Integration:
Web Information Retrieval:
12) Expand KDT and IE.
=KDT-K Dimensional Tree
IE-information Extraction
13) What is web mining and text mining?
= Web mining is the process of extracting useful information or patterns from the World Wide
Web. It involves techniques from data mining, machine learning, and statistics to analyze web
data, including web content, structure, and usage logs. Web mining can be used for various
purposes such as improving search engine performance, understanding user behavior, and
extracting market intelligence.
Text mining, also known as text analytics, is the process of deriving high-quality information
from textual data. It involves analyzing unstructured text data to discover patterns, trends, and
insights. Text mining techniques typically include natural language processing (NLP), machine
learning, and statistical analysis to extract valuable information from large volumes of text.
14) List any two issues and challenges in data mining.
= Data Quality: Data mining heavily relies on the quality of the data being analyzed. Poor-
quality data, such as missing values, inconsistencies, or inaccuracies, can significantly impact the
results and conclusions drawn from the analysis.
Overfitting: Overfitting occurs when a data mining model captures noise or random fluctuations
in the training data rather than the underlying patterns.
15) List the two application areas of data mining.
= Data mining in Education: ...
Data Mining in Healthcare: ...
Data Mining in Fraud Detection. ...
Data Mining in Lie Detection. ...
Data Mining in Market Basket Analysis.
Unit 2
2 marks questions
1) What is data warehouse?
= A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema and that usually resides at a single site.
2) How are organizations using the information from data warehouses?
= A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its
analytical capabilities allow organizations to derive valuable business insights from their data to
improve decision-making.
11) List any two benefits of Business analyst by having data warehouse.
=Quickly analyze data for various business applications.
Improve decision-making speed and efficiency.
Maintain the accuracy and reliability of data.
Reduce costs related to data storage and management.
12) List any two data warehouse models from the architecture point of view.
=1)ETL
2)data mart
3)logical data model
13) List any four back-end tools and utilities included in data warehouse.
=
14) Define data cleaning and data integration.
=Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly
formatted, duplicated, or insufficient data from a dataset.
Data Integration is a data preprocessing technique that combines data from multiple
heterogeneous data sources into a coherent data store and provides a unified view of the data.
15) Define data transformation and data reduction.
=Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information. Data transformation includes
data cleaning techniques and a data reduction technique to convert the data into the appropriate
form.
Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information.
16) What is Dimensionality Reduction? Mention any one method used for Dimensionality
Reduction.
=Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar information."
Principal component analysis (PCA)
Missing value ratio
17) List any two methods for data discretization.
= There are two forms of data discretization first is supervised discretization, and the second is
unsupervised discretization.
18) Write the difference between Operational database and Data Warehouse?
=
operational frameworks are more often Data warehousing frameworks are ordinarily
than not concerned with current data. concerned with verifiable information.
Relational databases are made for on-line Data Warehouse planned for on-line
value-based Preparing (OLTP) Analytical Processing (OLAP)
Operational database systems are generally While data warehouses are generally subject-
application-oriented. oriented.
here are mainly two approaches to designing data marts. These approaches are
5. Validate and QA
Multidimensional histograms are used in data science for analyzing and visualizing the
distribution of multidimensional data.
Unit IV
2 marks questions
1)What is Classification?
=Classification is a process of categorizing data or objects into predefined classes or categories
based on their features or attributes.
2) What is prediction?
=In data science, prediction refers to the process of using data and statistical algorithms to make
informed guesses about future outcomes or trends based on historical data.
3) What is regression Analysis?
=Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables.
4) Write steps involved in data classification?
=1. Perform A Risk Assessment For Sensitive Data
2. Establish A Data Classification Policy3. Categorize The Types Of Data
4.Identify Data Locations
5. Identify And Classify Data
6. Use Results To Improve Security And Compliance
5) What is supervised learning?
=Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
6) What is unsupervised learning?
=Unsupervised learning is a type of machine learning that learns from unlabeled data. This
means that the data does not have any pre-existing labels or categories. The goal of unsupervised
learning is to discover patterns and relationships in the data without any explicit guidance.
7) What is correlation Analysis?
=Correlation analysis is a statistical method used to measure the strength of the linear
relationship between two variables and compute their association. Correlation analysis calculates
the level of change in one variable due to the change in the other.
8) What is Decision tree induction?
=The goal of decision tree induction is to build a model that can accurately predict the outcome
of a given event, based on the values of the attributes in the dataset.
9) What is Decision tree?
=Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
10) What is tree pruning?
=Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
11) What is the use of tree selection measures?
=
12) Expand CART, ID3
=Classification and regression trees (CART)
ID3, refers to Iterative Dichotomizer 3
13) What is splitting criterion? List distinct splitting criteria’s values.
=
14) What is attribute selection measure? List popular attribute measures.
=Attribute selection measure (ASM) is a criterion used in decision tree algorithms to evaluate the
usefulness of different attributes for splitting a dataset.
15) What is information Gain? Write the formula to find expected information needed to
classify a tuple.
= Information Gain is the attribute selection measure that is used to find/select the best attribute in
a dataset or used to find the root node. The attribute with the highest information gain value is
selected as a root node for splitting criteria.
Gain(A) = Info(D) − Info A(D). where Gain(A) is information Gain of Attribute A , info(D)=−X
m i=1 pi log2 (pi), is entropy or expected information. Info A (D) = Xv j=1 |Dj | /|D| × Info(Dj).
16) What is entropy?
=Entropy is the measure of the degree of randomness or uncertainty in the dataset. In the case of
classifications, It measures the randomness based on the distribution of class labels in the dataset.
17) Define gain ratio. Write formula.
= Gain Ratio is a measure that takes into account both the information gain and the number of
outcomes of a feature to determine the best feature to split on.
Gain Ratio = Information Gain / Split Info
18) Define Gini index. Write formula.
=The Gini Index is a proportion of impurity or inequality in statistical and monetary settings. In
machine learning, it is utilized as an impurity measure in decision tree algorithms for
classification tasks. The Gini Index measures the probability of a haphazardly picked test being
misclassified by a decision tree algorithm, and its value goes from 0 (perfectly pure) to 1
(perfectly impure).
19) List the use of MDL.
=
20) Write two approaches to tree pruning.
= Pre-pruning Approach
Post-pruning Approach