0% found this document useful (0 votes)
684 views4 pages

Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is

Data screening involves checking data for errors prior to analysis in order to ensure data quality and validity. This includes identifying out-of-range values, unusual cases, duplicate cases, and other anomalies through manual checks and statistical analyses. Common issues involve inaccurate, missing, or outlier values. Fixing errors typically involves deleting or replacing problematic values while retaining overall data integrity. Screening helps maximize the useful information in data and minimize noise that could distort results.

Uploaded by

Abdullah Afzal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
684 views4 pages

Data Screening (Sometimes Referred To As "Data Screaming") Is The Process of Ensuring Your Data Is

Data screening involves checking data for errors prior to analysis in order to ensure data quality and validity. This includes identifying out-of-range values, unusual cases, duplicate cases, and other anomalies through manual checks and statistical analyses. Common issues involve inaccurate, missing, or outlier values. Fixing errors typically involves deleting or replacing problematic values while retaining overall data integrity. Screening helps maximize the useful information in data and minimize noise that could distort results.

Uploaded by

Abdullah Afzal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Data screening 

(sometimes referred to as "data screaming") is the process of ensuring your data is


clean and ready to go before you conduct further statistical analyses. Data must be screened in order to
ensure the data is useable, reliable, and valid for testing causal theory.

Data screening should be conducted prior to data recoding and data analysis, to help ensure the integrity of the
data. It is only necessary to screen the data for the variables and cases used for the analyses presented in the
lab report. Data screening means checking data for errors and fixing or removing these errors. The goal is to
maximize "signal" and minimize "noise" by identifying and fixing or removing errors.

It is very easy to make mistakes when entering data. Some errors can miss up your analysis. So, it is important
to spend the time for checking for the mistakes initially, rather than trying to repair the damage later, try another
person to check your data.

Fixing or removing incorrect data

Out-of-range values:

1. Out of range values are either below the minimum or above the maximum possible value.
Unusual cases:

1. Unusual cases occur when a case's responses are very different from the pattern of responses by most
other respondents.
Duplicate cases:

1. Duplicate cases occur when two or more cases have identical or near-identical data
Manual check for other anomalies

1. Check carefully through the data file (case by case and variable by variable) looking for and addressing any
oddities.
2. Empty cases: e.g., cases with no or little data could be removed

In this order:

1. Accuracy
2. Missing
3. Outlier
4. Assumptions:
 Additivity
 Normality
 Linearity
 Homogeneity / homoscedasticity

Accuracy:

Check for the problems with the dataset. Generally, you are looking for values that are out of range;
check out minimum and maximum to see if they are within what you would expect. Fix them or just delete
that data point. Do not delete the whole person, just the wrong data point.

Missing:
If you are missing much of your data, this can cause several problems. If you are missing several values in your data,
the analysis just won't run. To find out how many missing values each variable has, in SPSS go to Analyze, then
Descriptive Statistics, then Frequencies. Enter the variables in the variables list. Then click OK. The table in the
output will show the number of missing values for each variable.

There are two types of missing data:

1. MCAR: missing completely at random. It is probably caused by skipping a question or missing a trial.

For this: we should have to exclude or eliminate the data.

2. MNAR: missing not at random. It may be the question that is causing a problem.

For this: we should have to replace the data with a special function.

To impute values in SPSS, go to Transform, Replace Missing Values; then select the variables that need imputing,
and hit Ok. I use the Mean replacement method. But there are other options, including Median replacement. Typically
with Likert-type data, you want to use median replacement, because means are less meaningful in these scenarios.

Outliers:
Outliers can influence your results. Outliers are the cases with extreme value on one variable or multiple variables.
1. Univariate outliers: they are the outlier for one variable.

2. Multivariate outliers: they are the outlier for multiple variables. Your pattern of data is weird.

 Outliers will appear at the extremes, and will be labeled, as in the figure below. If you have a really high sample size,
then you may want to remove the outliers. If you are working with a smaller dataset, you may want to be less liberal
about deleting records. However, this is a trade-off, because outliers will influence small datasets more than large
ones.
 Another type of outlier is an unengaged respondent. Sometimes respondents will enter '3, 3, 3, 3,...' for
every single survey item.
 See if the participant answered reverse-coded questions in the same direction as normal questions. For
example, if they responded strongly agree to both of these items, then they were not paying attention: "I am
very hungry", "I don't have much appetite right now".

Multivariate outliers:

refer to records that do not fit the standard sets of correlations exhibited by the other records in the dataset, with
regards to your causal model. So, if all but one person in the dataset reports that diet has a positive effect on weight
loss, but this one guy reports that he gains weight when he diets, then his record would be considered a multivariate
outlier. To detect these influential multivariate outliers, you need to calculate the Mahalanobis d-squared.

You might also like