CC Unit - 4 Imp Questions
CC Unit - 4 Imp Questions
Unit – 4
Important questions
Data preprocessing is the process of transforming raw data into an understandable format. The
quality of the data should be checked before applying machine learning or data mining
algorithms.
Steps Involved in Data Preprocessing:
• Data Cleaning: The data can have many irrelevant and missing parts. ...
• Data Transformation: This step is taken in order to transform the data in appropriate
forms suitable for mining process. ...
• Data Reduction: Since data mining is a technique that is used to handle huge amount of
data.
2.What are sampling? Explain the types of data elements
Big data also encompasses a wide variety of data types, including the following:
structured data, such as transactions and financial records;
unstructured data, such as text, documents and multimedia files; and.
semi structured data, such as web server logs and streaming data from sensors.
3.Explain outlier detection and treatment
An outlier may also be explained as a piece of data or observation that deviates drastically from
the given norm or average of the data set. ... Therefore, Outlier Detection may be defined as
the process of detecting and subsequently excluding outliers from a given set of data
ways to deal with outliers in data
Set up a filter in your testing tool. Even though this has a little cost, filtering out outliers is
worth it. ...
Remove or change outliers during post-test analysis. ...
Change the value of outliers. ...
Consider the underlying distribution. ...
• Missing value - Missing value occurs when there is no data value for a variable in an
observation. The phenomenon of missing value is universal in clinical researches
involving big data.
• Categorization - Categorization is the exercise of creating meaningful categories for a
particular variable/feature in the dataset, for better inference/understanding.
• Weights and evidence coding - The weight of evidence approach means that you use a
combination of information from several independent sources to give sufficient
evidence to fulfil an information requirement.
• Variable selection - Variable selection is a collection of candidate model variables tested
for significance during model training. Candidate model variables are also known as
independent variables, predictors, attributes, model factors, covariates, regressors,
features, or characteristics.
• Segmentation - Segmentation refers to the act of segmenting data according to your
company's needs in order to refine your analyses based on a defined context, using a
tool for cross-calculating analyses. In concrete terms, a segment enables you to filter
your analyses based on certain elements (single or combined).
6.Explain the predictive analytics in briefly
Predictive analytics is the use of data, statistical algorithms and machine learning techniques to
identify the likelihood of future outcomes based on historical data. The goal is to go beyond
knowing what has happened to providing a best assessment of what will happen in the future.
Though predictive analytics has been around for decades, it's a technology whose time has
come. More and more organizations are turning to predictive analytics to increase their bottom
line and competitive advantage. Why now?
• Growing volumes and types of data, and more interest in using data to produce valuable
insights.
• Faster, cheaper computers.
• Easier-to-use software.
• Tougher economic conditions and a need for competitive differentiation.
With interactive and easy-to-use software becoming more prevalent, predictive analytics is no
longer just the domain of mathematicians and statisticians. Business analysts and line-of-
business experts are using these technologies as well.
7.Explain the descriptive analytics in briefly
Descriptive analytics is the interpretation of historical data to better understand changes that
have occurred in a business. Descriptive analytics describes the use of a range of historic data to
draw comparisons. Most commonly reported financial metrics are a product of descriptive
analytics, for example, year-over-year pricing changes, month-over-month sales growth, the
number of users, or the total revenue per subscriber. These measures all describe what has
occurred in a business during a set period.
KEY TAKEAWAYS
Descriptive analytics is the process of parsing historical data to better understand the changes
that have occurred in a business.
Using a range of historic data and benchmarking, decision-makers obtain a holistic view of
performance and trends on which to base business strategy.
Descriptive analytics can help to identify the areas of strength and weakness in an organization.
Examples of metrics used in descriptive analytics include year-over-year pricing changes,
month-over-month sales growth, the number of users, or the total revenue per subscriber.
Descriptive analytics is now being used in conjunction with newer analytics, such as predictive
and prescriptive analytics.
In its simplest form, descriptive analytics answers the question, "What happened?"
• Linear regression - Linear regression analysis is used to predict the value of a variable
based on the value of another variable. The variable you want to predict is called the
dependent variable. The variable you are using to predict the other variable's value is
called the independent variable.
• Decision tree - A Decision Tree is an algorithm used for supervised learning problems such as
classification or regression. A decision tree or a classification tree is a tree in which each internal
(nonleaf) node is labeled with an input feature.
• Neural networks - A neural network is a series of algorithms that endeavors to recognize
underlying relationships in a set of data through a process that mimics the way the human brain
operates. In this sense, neural networks refer to systems of neurons, either organic or artificial
in nature.
• Association rule - Association rules are created by searching data for frequent if-then
patterns and using the criteria support and confidence to identify the most important
relationships.
• Sequence rule - A sequence rule consists of a previous sequence in the rule body that
leads to a consecutive item set in the rule head. The consecutive item set occurs after a
particular period of time. Sequence rules, sequences, and item sets have various
characteristics.
• Social network leaning - Social networking big data is a collection of extremely big data
sets with great diversity in social networks. Social networking big data is also a core
component for the social influence analysis and the security.
10.Explain relational Neighbours classification?
A relation model is based on the idea that the behavior between nodes is correlated, meaning
that connected nodes have a propensity to belong to the same class. The relational neighbor
classifier, in particular, predicts a node's class based on its neighboring nodes and adjacent
edges.
The relational neighbor classifier, in particular, predicts a node's class based on its neighboring
nodes and adjacent edges. The dataset transfers consist of transactions from different
accounts. The account info data contains which of these accounts are money mules.