0% found this document useful (0 votes)
57 views

Data Mining and Neural Network

This document discusses data completeness and ways to handle missing data values. It defines data completeness as the extent to which all expected data is present in a dataset. It provides an example where a dataset with 500 fields and 100 missing fields would have a completeness of 80%. The document emphasizes that incomplete data can be costly and impact important decisions. It lists five ways to handle missing values: deleting records with missing values, predicting values with a separate model, and using statistical techniques. The overall goal of data completeness is to ensure the information needed is valid, accurate, and accessible.

Uploaded by

Subhana Hashmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Data Mining and Neural Network

This document discusses data completeness and ways to handle missing data values. It defines data completeness as the extent to which all expected data is present in a dataset. It provides an example where a dataset with 500 fields and 100 missing fields would have a completeness of 80%. The document emphasizes that incomplete data can be costly and impact important decisions. It lists five ways to handle missing values: deleting records with missing values, predicting values with a separate model, and using statistical techniques. The overall goal of data completeness is to ensure the information needed is valid, accurate, and accessible.

Uploaded by

Subhana Hashmi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

DATA MINING AND NEURAL

NETWORK
Assignment
Task 1 3

Data completeness is defined as the extent to which all data in a data set is
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

available in the data quality system. The percentage of incomplete data entries is
3 3 3 3 3 3 3 3 3 3 3 3 3

a metric for data completeness.


3 3 3 3 3

A completeness degree of 80% is achieved by a column of 500 fields with 100


3 3 3 3 3 3 3 3 3 3 3 3 3 3

missing fields. Depending on your industry, missing 20% of entries could cost you
3 3 3 3 3 3 3 3 3 3 3 3 3

hundreds of thousands of dollars in lost prospects and leads!


3 3 3 3 3 3 3 3 3 3

Data completeness, on the other hand, isn't about having any of the fields filled
3 3 3 3 3 3 3 3 3 3 3 3 3

up. It's all about figuring out what data is important and what isn't. Phone
3 3 3 3 3 3 3 3 3 3 3 3 3 3

numbers, for example, are required, but fax numbers are optional. It's all too
3 3 3 3 3 3 3 3 3 3 3 3 3

tempting to overlook missing values and continue collecting reports or carrying


3 3 3 3 3 3 3 3 3 3 3

out activities despite the fact that data is missing. The end result is tasks that
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

produce bad results (such as an email that missed missing last names and had to
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

deal with duplicates), reports that include false findings that have an effect on
3 3 3 3 3 3 3 3 3 3 3 3 3

legislation and important changes, failed business strategies, and legal errors. As a
3 3 3 3 3 3 3 3 3 3 3 3

result of all of this, the true objective of data completeness isn't to have flawless,
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

100 percent data. It's to make sure that the information you need is valid, total,
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

reliable, and accessible. The technologies you have at your side, such as DME, will
3 3 3 3 3 3 3 3 3 3 3 3 3 3

assist you in getting there.


3 3 3 3 3

We have 5 ways to handle the missing values which are as following:


3 3 3 3 3 3 3 3 3 3 3 3

 Delete the records which have missing values.


3 3 3 3 3 3

 To predict the missing values, train a separate model.


3 3 3 3 3 3 3 3

 Usage of techniques (statistical) to fill the missing values.


3 3 3 3 3 3 3 3

 Transformation of features 3 3

 A technical approach to improve time series.


3 3 3 3 3 3
Delete the rows which have missing values n restructure the dataset.
3 3 3 3 3 3 3 3 3 3

Except for one, the third record has three missing values. If we have a large
3 3 3 3 3 3 3 3 3 3 3 3 3 3

dataset, it's probably better if we delete this record because we'll have to
3 3 3 3 3 3 3 3 3 3 3 3 3

estimate and fill in the rest of the values otherwise. This approach is only useful if
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

you have a large amount of data to ensure that no information is lost. Most of the
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

tools we looked at said that if more than 30–35 percent of the data is missing, the
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

feature should be removed.


3 3 3 3
To predict the missing values, train a separate model.
3 3 3 3 3 3 3 3

Suppose we selected some data from the upper time series.


3 3 3 3 3 3 3 3 3

There are two missing values in the First Column (Feature 1).
3 3 3 3 3 3 3 3 3 3

What if we train a different model on columns with no missing values (Row


3 3 3 3 3 3 3 3 3 3 3 3 3

2&Row 3) and use the rest of the rows as functions, with Feature 1 as the target
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

class? Is there something wrong with that? This is essentially what this approach
3 3 3 3 3 3 3 3 3 3 3 3 3

states: Simply estimate the missing values if the data is available.


3 3 3 3 3 3 3 3 3 3 3

Missing values are handled by some packages, such as Random Forest, by


3 3 3 3 3 3 3 3 3 3 3

dropping them or filing them with median or mode. Missing values are ignored or
3 3 3 3 3 3 3 3 3 3 3 3 3 3

treated as a different category in Decision Trees algorithms (such as ID3).


3 3 3 3 3 3 3 3 3 3 3 3

Plot the results:


3 3
Z score plotting:
3 3

formula 3

After normalizing the data from z- score the plotting shows the results.
3 3 3 3 3 3 3 3 3 3 3

Task 2 3
To make it easier, we'll assume we've already measured n samples of model errors
3 3 3 3 3 3 3 3 3 3 3 3 3

using the formula (ei, I = 1,2,..., n). The uncertainty introduced by observation
3 3 3 3 3 3 3 3 3 3 3 3 3

errors, as well as the approach used to compare model and observations, are not
3 3 3 3 3 3 3 3 3 3 3 3 3 3

taken into account. We often take it for granted that the error sample collection is
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

unbiased.
3 3
The upper representation of each segment shows that all attributes are different
3 3 3 3 3 3 3 3 3 3 3

from each other in time series analysis.


3 3 3 3 3 3 3

Task 3 3

The scatter diagram of g(t), g(t+1)


3 3 3 3 3

We choose the data of 2018 as sample to calculate the predictors for g(t+1) they
3 3 3 3 3 3 3 3 3 3 3 3 3 3

day next values.


3 3 3 3
This model's MSE comes out to be 5.917.
3 3 3 3 3 3 3

We try to fit linear models in so many difficult problem settings that we have no
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

reason to believe the true data generating model is linear, particularly when the
3 3 3 3 3 3 3 3 3 3 3 3 3

errors are Gaussian or homoscedastic. As a result, a contemporary viewpoint:


3 3 3 3 3 3 3 3 3 3 3

Since the linear model is just a rough approximation, assess prediction accuracy
3 3 3 3 3 3 3 3 3 3 3

before deciding on its utility.


3 3 3 3 3

Focusing on prediction is a much more common concept than linear models. We'll
3 3 3 3 3 3 3 3 3 3 3 3

come back to this later in the week, but for now, here's a quick recap:
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Models are just approximations; some methods don't even need underlying
3 3 3 3 3 3 3 3 3

models; let's assess prediction accuracy and use that to decide model/method
3 3 3 3 3 3 3 3 3 3 3

utility.
3

Assume we have training data Xi1,...,Xip,Yi, i=1,...,n, which we will use to estimate
3 3 3 3 3 3 3 3 3 3 3 3

regression coefficients 0,1,...,p.


3 3 3
New X1,...,Xp are presented, and you must predict the related Y. Y=0+1X1+...
3 3 3 3 3 3 3 3 3 3 3

+pXp+pXp+pXp+pXp+pXp+pXp+pXp+pXp+pXp+pXp+pXp+pXp+pXp+pXp+ The test 3 3

error, also known as prediction error, is defined as E(YY)2, where the expectation
3 3 3 3 3 3 3 3 3 3 3 3 3

is over any random: training data, Xi1,...,Xip,Yi, i=1,...,n and test data, X1,...,Xp,Yi,
3 3 3 3 3 3 3 3 3 3 3 3

i=1,...,n.
3

This was clarified in the context of a linear model, but the concept of test error is
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

the same in all cases.


3 3 3 3 3

Often, we want a precise estimation of our method's test error (e.g., linear
3 3 3 3 3 3 3 3 3 3 3 3

regression). What is the reason for this? There are two primary goals:
3 3 3 3 3 3 3 3 3 3 3 3

Predictive analysis: get a firm grasp on the magnitude of errors we can foresee
3 3 3 3 3 3 3 3 3 3 3 3 3

when making potential predictions.


3 3 3 3

Model/method selection: choose from a variety of models/methods in order to 3 3 3 3 3 3 3 3 3 3

reduce test error.


3 3 3

Assume we use the observed training error 1ni=1n(YiYi)2 to estimate the test error
3 3 3 3 3 3 3 3 3 3 3 3

of our system.
3 3 3

What's the issue here? Generally overly optimistic as a test error estimate—after
3 3 3 3 3 3 3 3 3 3 3

all, the parameters 0–1,...,p were chosen to get Yi near to Yi, i=1,...,n in the first
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

place!
3

Also, the more complex/adaptive the system is, the more optimistic its training
3 3 3 3 3 3 3 3 3 3 3

error is as a test error estimate.


3 3 3 3 3 3 3
Task 4 3

The train, test, and sometimes tune set must all be defined before the supervised
3 3 3 3 3 3 3 3 3 3 3 3 3

learning process can begin. The three sets listed in the standard time series
3 3 3 3 3 3 3 3 3 3 3 3 3

prediction task are consecutive sequences of items (e.g. train: g1, gf; tune: gf+1,
3 3 3 3 3 3 3 3 3 3 3 3 3

g+r; test: gf+r+1, gf+r+pmax, f, r, pmax N). It's also known as a straightforward
3 3 3 3 3 3 3 3 3 3 3 3 3 3

implementation of the walk-forward routine.


3 3 3 3 3

On the basis that the data collection can be moved into new spaces as quickly as
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

possible,
3

The following sets result from a prediction horizon of pm: train(g1, g1+pm)
3 3 3 3 3 3 3 3 3 3 3

(xn3pm, xn2pm ); tune (xn3pm+1, xn2pm+1),..., (xn2pm, xnpm ); test (xn3pm+1,


3 3 3 3 3 3 3 3 3 3

3xn2pm+1),..., (gn2pm, gnpm ) (xn2pm+1, xnpm+1),..., (xnpm, xn) are some


3 3 3 3 3 3 3 3 3

3examples.
In Regression linear Ordinary least squares residuals are often used to measure
3 3 3 3 3 3 3 3 3 3 3

3unknown true errors. Due to shrinkage and superimposed normality effects, these
3 3 3 3 3 3 3 3 3 3

3estimates can offer a false impression of the true error distribution. RMOLS is a
3 3 3 3 3 3 3 3 3 3 3 3 3

3new approach for improving moment estimation by appropriately rescaling the


3 3 3 3 3 3 3 3 3

3moment estimators obtained from least squares residuals. These RMOLS


3 3 3 3 3 3 3 3

3moments provide more accurate estimates of skewness and kurtosis coefficients,


3 3 3 3 3 3 3 3 3

3as well as greater power for one form of normality measure. A Monte Carlo
3 3 3 3 3 3 3 3 3 3 3 3 3

3analysis using a number of random error distributions demonstrates these


3 3 3 3 3 3 3 3 3

3properties.
Calculate the mean square error:
3 3 3 3

A regression line's mean squared error indicates how close it is to a collection of


3 3 3 3 3 3 3 3 3 3 3 3 3 3

points. It accomplishes this by squaring the distances between the points and the
3 3 3 3 3 3 3 3 3 3 3 3 3

regression line (these distances are the "errors"). Squaring is needed to eliminate
3 3 3 3 3 3 3 3 3 3 3 3

any negative signs. It also gives larger variations more weight. Since you're
3 3 3 3 3 3 3 3 3 3 3 3

calculating the sum of a number of errors, it's called the mean squared error.
3 3 3 3 3 3 3 3 3 3 3 3 3 3

To determine the mean squared error from a set of X and Y values, follow these
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

steps:
3

The regression line should be found.


3 3 3 3 3

To find the new Y values (Y'), plug the X values into the linear regression equation.
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

To find the error, subtract the new Y value from the original.
3 3 3 3 3 3 3 3 3 3 3

The errors must be squared.


3 3 3 3
Add up all of the mistakes.
3 3 3 3 3

Calculate the average. 3 3

Calculate the mean squared error for these numbers: (43,41), (44,45), (45,49),
3 3 3 3 3 3 3 3 3 3

(46,47). (47,44).
3 3

Step 1:Find the regression line. I used this online calculator and got the regression
3 3 3 3 3 3 3 3 3 3 3 3 3

line y= 9.2 + 0.8x.


3 3 3 3 3

Step 2: 3 3

Find the new Y’ values:


3 3 3 3

9.2 + 0.8(43) = 43.6


3 3 3 3

9.2 + 0.8(44) = 44.4


3 3 3 3

9.2 + 0.8(45) = 45.2


3 3 3 3

9.2 + 0.8(46) = 46
3 3 3 3

9.2 + 0.8(47) = 46.8


3 3 3 3

Step 3: 3 3

Find the error (Y – Y’):


3 3 3 3 3

41 – 43.6 = -2.6
3 3 3 3

45 – 44.4 = 0.6
3 3 3 3

49 – 45.2 = 3.8
3 3 3 3

47 – 46 = 1
3 3 3 3

44 – 46.8 = -2.8
3 3 3 3

Step 4: 3 3

Square the Errors: 3 3

-2.62 = 6.76 3 3

0.62 = 0.36 3 3

3.82 = 14.44 3 3

12 = 1
3 3

-2.82 = 7.84 3 3

This table shows the results so far:


3 3 3 3 3 3

Step 5: 3

Add all of the squared errors up: 6.76 + 0.36 + 14.44 + 1 + 7.84 = 30.4.
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Step 6: 3 3

Find the mean squared error:


3 3 3 3

30.4 / 5 = 6.08.3 3 3 3

Comments:
The smaller the means squared error, the closer you are to finding the line of best
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

fit. Depending on your data, it may be impossible to get a very small value for the
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

mean squared error. For example, the above data is scattered wildly around the
3 3 3 3 3 3 3 3 3 3 3 3 3

regression line, so 6.08 is as good as it gets (and is in fact, the line of best fit). Note
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

that I used an online calculator to get the regression line; where the mean
3 3 3 3 3 3 3 3 3 3 3 3 3 3

squared error really comes in handy is if you were finding an equation for the
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

regression line by hand: you could try several equations, and the one that gave
3 3 3 3 3 3 3 3 3 3 3 3 3 3

you the smallest mean squared error would be the line of best fit.
3 3 3 3 3 3 3 3 3 3 3 3 3

Sometimes, a statistical model or estimator must be “tweaked” to get the best


3 3 3 3 3 3 3 3 3 3 3 3

possible model or estimator. The MSE criterion is a tradeoff between (squared)


3 3 3 3 3 3 3 3 3 3 3 3

bias and variance and is defined as:


3 3 3 3 3 3 3

“T is a minimum [MSE] estimator of θ if MSE(T, θ) ≤ MSE(T’ θ), where T’ is any


3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

alternative estimator of θ (Panik).”


3 3 3 3 3

You might also like