0% found this document useful (0 votes)
110 views5 pages

Data Mining Assignment Help

This document discusses various topics related to data mining and regression analysis including: - Training and test sets are used to train models on known data and test them on unknown data to assess real-world performance. 75% of data should be allocated to the training set. - Predictors should be removed if they provide no value, replicate other predictors, or have many missing values. - Stratified sampling better represents scenarios by taking random samples within pre-defined groups, unlike simple random sampling. - Model tuning through hyperparameter optimization is necessary to find the combination that minimizes loss and improves results. - The predictive model building process involves data splitting, resampling, model selection, parameter tuning,
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views5 pages

Data Mining Assignment Help

This document discusses various topics related to data mining and regression analysis including: - Training and test sets are used to train models on known data and test them on unknown data to assess real-world performance. 75% of data should be allocated to the training set. - Predictors should be removed if they provide no value, replicate other predictors, or have many missing values. - Stratified sampling better represents scenarios by taking random samples within pre-defined groups, unlike simple random sampling. - Model tuning through hyperparameter optimization is necessary to find the combination that minimizes loss and improves results. - The predictive model building process involves data splitting, resampling, model selection, parameter tuning,
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 5

For any Homework related queries, call us at- +1 678 648 4277

You can mail us at:- [email protected] or


reach us at- https://www.statisticshomeworksolver.com/

Data Mining Assignment Help


Data Mining and Regression

These questions cover a wide range of data mining and


regression sub-topics. It involves concepts like:
 
• Training set and test
• Data reduction
• Sampling
• Data splitting and re-sampling
• Regression

Training and Test Sets


 
What are training set and test set used for respectively? If
splitting a dataset by assigning 75% to one set while 25%
to another set, is it 75% or 25% that should go to training
set?

Ans: Training set is used to train the model at a known


sample so that model can learn its parameters. Test set is used
for the model performance testing using out of sample
examples which was not used to train the model in order to
assess the real-world performance of the model. 75% of the
data should go to training the model so that it can reliably
estimate the parameters.
Data Reduction
 
Removing predictor(s) is generally known as a data
reduction technique. Explain under what
conditions we should consider removing predictors.
Ans: Predictors can be removed under certain conditions such
as:
a) Predictor is not adding any value to the problem in logical
sense, like name, serial number etc.
b) Predictor is replicating same information which is covered
in any other predictor.
c) Lots of missing values in the predictor which may lead to
bad fit.

Sampling
 
What is the difference(s) between simple random sampling
and stratified random sampling?

Ans: Simple random sampling is just taking a k out of n


objects randomly. In these sampling scheme, every possible
sample must have equal probability of getting selected.
In Stratified sampling, there are well defined groups or strata,
and simple random sampling is done inside each stratum and
included into the sample. These are, in most cases, a better
alternative to represent actual scenario especially in case of
class imbalance.
 Why is model tuning necessary for predictive modelling?

Ans: Hyperparameters are crucial as they control the overall


behaviour of a machine learning model. The ultimate goal is
to find an optimal combination of hyperparameters that
minimizes a predefined loss function to give better results.
This is why model tuning is important as to get the optimum
model based on problem statement. There can be n number of
models for every task but to get the best out of it,
hyperparameters must be tuned.

Predictive Model Building


 
Use your words to describe the process of building
predictive models considering data splitting and data
resampling (referring to the graph below).

Ans: The steps of model building is outlined below:

Step 1: Select/Get Data


Step 2: Data cleaning/Data pre-processing
Step 3: Data splitting: Into training and test sets
Step 4: Split training set into Training and Validation set
Step 5: Model Selection and Develop Models (Training)
Step 6: Parameter tuning (Validation set), Optimize
Step 7: Testing and model performance evaluation
 
Linear Regressi
 
List three linear regression models we learned in class.
What metrics can be used to compare the linear model
predictive performance?
Ans: The regression models are Ordinary least square
regression, Kernel regression, k-NN regression, MARS Model.
 
What are the two tuning parameters associated with
Multivariate Adaptive Regression Splines (MARS) model?
How to determine the optimal values for the tuning
parameters?

Ans: Two parameters are degree and nprune. Both of these are
determined by testing the model performance on validation set.
 
Define K-Nearest Neighbours (KNN) regression method
and indicate whether pre-processing predictors is needed
prior to performing KNN.

Ans: KNN regression is a non-parametric method that, in an


intuitive manner, approximates the association between
independent variables and the continuous outcome
by averaging the observations in the same neighbourhood.
The size of the neighbourhood needs to be set by the analyst
or can be chosen using cross-validation to select the size that
minimises the mean-squared error. Generally, pre-processing
here includes making the features similar and numeric so that
distance can be calculated. So we centre and scale the data.

You might also like