Data Mining and Neural Network
Data Mining and Neural Network
NETWORK
Assignment
Task 1 3
Data completeness is defined as the extent to which all data in a data set is
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
available in the data quality system. The percentage of incomplete data entries is
3 3 3 3 3 3 3 3 3 3 3 3 3
missing fields. Depending on your industry, missing 20% of entries could cost you
3 3 3 3 3 3 3 3 3 3 3 3 3
Data completeness, on the other hand, isn't about having any of the fields filled
3 3 3 3 3 3 3 3 3 3 3 3 3
up. It's all about figuring out what data is important and what isn't. Phone
3 3 3 3 3 3 3 3 3 3 3 3 3 3
numbers, for example, are required, but fax numbers are optional. It's all too
3 3 3 3 3 3 3 3 3 3 3 3 3
out activities despite the fact that data is missing. The end result is tasks that
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
produce bad results (such as an email that missed missing last names and had to
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
deal with duplicates), reports that include false findings that have an effect on
3 3 3 3 3 3 3 3 3 3 3 3 3
legislation and important changes, failed business strategies, and legal errors. As a
3 3 3 3 3 3 3 3 3 3 3 3
result of all of this, the true objective of data completeness isn't to have flawless,
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
100 percent data. It's to make sure that the information you need is valid, total,
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
reliable, and accessible. The technologies you have at your side, such as DME, will
3 3 3 3 3 3 3 3 3 3 3 3 3 3
Transformation of features 3 3
Except for one, the third record has three missing values. If we have a large
3 3 3 3 3 3 3 3 3 3 3 3 3 3
dataset, it's probably better if we delete this record because we'll have to
3 3 3 3 3 3 3 3 3 3 3 3 3
estimate and fill in the rest of the values otherwise. This approach is only useful if
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
you have a large amount of data to ensure that no information is lost. Most of the
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
tools we looked at said that if more than 30–35 percent of the data is missing, the
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
There are two missing values in the First Column (Feature 1).
3 3 3 3 3 3 3 3 3 3
2&Row 3) and use the rest of the rows as functions, with Feature 1 as the target
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
class? Is there something wrong with that? This is essentially what this approach
3 3 3 3 3 3 3 3 3 3 3 3 3
dropping them or filing them with median or mode. Missing values are ignored or
3 3 3 3 3 3 3 3 3 3 3 3 3 3
formula 3
After normalizing the data from z- score the plotting shows the results.
3 3 3 3 3 3 3 3 3 3 3
Task 2 3
To make it easier, we'll assume we've already measured n samples of model errors
3 3 3 3 3 3 3 3 3 3 3 3 3
using the formula (ei, I = 1,2,..., n). The uncertainty introduced by observation
3 3 3 3 3 3 3 3 3 3 3 3 3
errors, as well as the approach used to compare model and observations, are not
3 3 3 3 3 3 3 3 3 3 3 3 3 3
taken into account. We often take it for granted that the error sample collection is
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
unbiased.
3 3
The upper representation of each segment shows that all attributes are different
3 3 3 3 3 3 3 3 3 3 3
Task 3 3
We choose the data of 2018 as sample to calculate the predictors for g(t+1) they
3 3 3 3 3 3 3 3 3 3 3 3 3 3
We try to fit linear models in so many difficult problem settings that we have no
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
reason to believe the true data generating model is linear, particularly when the
3 3 3 3 3 3 3 3 3 3 3 3 3
Since the linear model is just a rough approximation, assess prediction accuracy
3 3 3 3 3 3 3 3 3 3 3
Focusing on prediction is a much more common concept than linear models. We'll
3 3 3 3 3 3 3 3 3 3 3 3
come back to this later in the week, but for now, here's a quick recap:
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Models are just approximations; some methods don't even need underlying
3 3 3 3 3 3 3 3 3
models; let's assess prediction accuracy and use that to decide model/method
3 3 3 3 3 3 3 3 3 3 3
utility.
3
Assume we have training data Xi1,...,Xip,Yi, i=1,...,n, which we will use to estimate
3 3 3 3 3 3 3 3 3 3 3 3
error, also known as prediction error, is defined as E(YY)2, where the expectation
3 3 3 3 3 3 3 3 3 3 3 3 3
is over any random: training data, Xi1,...,Xip,Yi, i=1,...,n and test data, X1,...,Xp,Yi,
3 3 3 3 3 3 3 3 3 3 3 3
i=1,...,n.
3
This was clarified in the context of a linear model, but the concept of test error is
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Often, we want a precise estimation of our method's test error (e.g., linear
3 3 3 3 3 3 3 3 3 3 3 3
regression). What is the reason for this? There are two primary goals:
3 3 3 3 3 3 3 3 3 3 3 3
Predictive analysis: get a firm grasp on the magnitude of errors we can foresee
3 3 3 3 3 3 3 3 3 3 3 3 3
Assume we use the observed training error 1ni=1n(YiYi)2 to estimate the test error
3 3 3 3 3 3 3 3 3 3 3 3
of our system.
3 3 3
What's the issue here? Generally overly optimistic as a test error estimate—after
3 3 3 3 3 3 3 3 3 3 3
all, the parameters 0–1,...,p were chosen to get Yi near to Yi, i=1,...,n in the first
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
place!
3
Also, the more complex/adaptive the system is, the more optimistic its training
3 3 3 3 3 3 3 3 3 3 3
The train, test, and sometimes tune set must all be defined before the supervised
3 3 3 3 3 3 3 3 3 3 3 3 3
learning process can begin. The three sets listed in the standard time series
3 3 3 3 3 3 3 3 3 3 3 3 3
prediction task are consecutive sequences of items (e.g. train: g1, gf; tune: gf+1,
3 3 3 3 3 3 3 3 3 3 3 3 3
g+r; test: gf+r+1, gf+r+pmax, f, r, pmax N). It's also known as a straightforward
3 3 3 3 3 3 3 3 3 3 3 3 3 3
On the basis that the data collection can be moved into new spaces as quickly as
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
possible,
3
The following sets result from a prediction horizon of pm: train(g1, g1+pm)
3 3 3 3 3 3 3 3 3 3 3
3examples.
In Regression linear Ordinary least squares residuals are often used to measure
3 3 3 3 3 3 3 3 3 3 3
3unknown true errors. Due to shrinkage and superimposed normality effects, these
3 3 3 3 3 3 3 3 3 3
3estimates can offer a false impression of the true error distribution. RMOLS is a
3 3 3 3 3 3 3 3 3 3 3 3 3
3as well as greater power for one form of normality measure. A Monte Carlo
3 3 3 3 3 3 3 3 3 3 3 3 3
3properties.
Calculate the mean square error:
3 3 3 3
points. It accomplishes this by squaring the distances between the points and the
3 3 3 3 3 3 3 3 3 3 3 3 3
regression line (these distances are the "errors"). Squaring is needed to eliminate
3 3 3 3 3 3 3 3 3 3 3 3
any negative signs. It also gives larger variations more weight. Since you're
3 3 3 3 3 3 3 3 3 3 3 3
calculating the sum of a number of errors, it's called the mean squared error.
3 3 3 3 3 3 3 3 3 3 3 3 3 3
To determine the mean squared error from a set of X and Y values, follow these
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
steps:
3
To find the new Y values (Y'), plug the X values into the linear regression equation.
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
To find the error, subtract the new Y value from the original.
3 3 3 3 3 3 3 3 3 3 3
Calculate the mean squared error for these numbers: (43,41), (44,45), (45,49),
3 3 3 3 3 3 3 3 3 3
(46,47). (47,44).
3 3
Step 1:Find the regression line. I used this online calculator and got the regression
3 3 3 3 3 3 3 3 3 3 3 3 3
Step 2: 3 3
9.2 + 0.8(46) = 46
3 3 3 3
Step 3: 3 3
41 – 43.6 = -2.6
3 3 3 3
45 – 44.4 = 0.6
3 3 3 3
49 – 45.2 = 3.8
3 3 3 3
47 – 46 = 1
3 3 3 3
44 – 46.8 = -2.8
3 3 3 3
Step 4: 3 3
-2.62 = 6.76 3 3
0.62 = 0.36 3 3
3.82 = 14.44 3 3
12 = 1
3 3
-2.82 = 7.84 3 3
Step 5: 3
Add all of the squared errors up: 6.76 + 0.36 + 14.44 + 1 + 7.84 = 30.4.
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Step 6: 3 3
30.4 / 5 = 6.08.3 3 3 3
Comments:
The smaller the means squared error, the closer you are to finding the line of best
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
fit. Depending on your data, it may be impossible to get a very small value for the
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
mean squared error. For example, the above data is scattered wildly around the
3 3 3 3 3 3 3 3 3 3 3 3 3
regression line, so 6.08 is as good as it gets (and is in fact, the line of best fit). Note
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
that I used an online calculator to get the regression line; where the mean
3 3 3 3 3 3 3 3 3 3 3 3 3 3
squared error really comes in handy is if you were finding an equation for the
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
regression line by hand: you could try several equations, and the one that gave
3 3 3 3 3 3 3 3 3 3 3 3 3 3
you the smallest mean squared error would be the line of best fit.
3 3 3 3 3 3 3 3 3 3 3 3 3