Unit 2

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Unit-2

Test Construction: Item Construction, Item Analysis, develop test administration, norms,
scoring and Interpretation of the tests; Tester’s Bias and Extraneous Factors.

TEST CONSTRUCTION

Attention must be given to the below mentioned points while constructing a potent,
constructive and relevant questionnaire/schedule:

• The researcher must first define the problem that s/he wants to examine, as it will lay the
foundation of the questionnaire. There must be a complete clarity about the various facets of
the research problem that will be encountered as the research progresses.

• The correct formulation of questions is dependent on the kind of information the researcher
seeks, the objective of analysis and the respondents of the schedule/questionnaire. Whether to
use open ended or close ended questions should be decided by the researcher. They should be
uncomplicated and made with such a view that there will be an objective part of a calculated
tabulation plan.

• A researcher must prepare a rough draft of the schedule while giving ample thought to the
sequence in which s/he wants to place the questions. Previous examples of such
questionnaires can also be observed at this stage.

• A researcher by default should recheck and if required make changes in the rough draft to
improve the same. Technical discrepancies should be examined in detail and changed
accordingly.

• There should be a pre-testing done through a pilot study and changes should be made to the
questionnaire if required.

• The questions should be easy to understand the directions to fill up the questionnaire
clearly mentioned; this should be done to avoid any confusion.

The primary objective of developing a tool is obtaining a set of data that is accurate,
trustworthy and authentic so as to enable the researcher in gauging the current situation
correctly and reaching conclusions that can provide executable suggestions. But, no tool is
absolutely accurate and valid, thus, it should carry a declaration that clearly mentions its
reliability and validity.

Steps for Test Construction

Gregory (1992) described five steps in test construction:


(a) defining the test (e.g., purpose, content),
(b) selecting a scaling method (i.e., rules by which numbers or categories are assigned to
responses),
(c) constructing the items (e.g., developing a table of specifications that describes the specific
method employed to measure the test’s content areas ),
(d) testing the items (i.e., administer the items and then conduct an item analysis), and (e)
revising the test (e.g., cross-validate it with another sample because validity shrinkage almost
always occurs).

A researcher evaluating a new mathematics curriculum, for example, might (a) desire a test
that could show changes over time in mathematics skills, (b) assign a score of 1 to each math
item correctly scored, (c) create a table of specifications indicating what kind of skills would
be expected to be acquired, (d) run a study to determine which items were sensitive to
change, and (e) repeat the process with the selected items with a new group of students.

Standardization of Psychological Tests

Standardization refers to the consistency of processes and procedures that are used for
conducting and scoring of a test. To compare the scores of different individuals the
conditions should be the same. In case of a new step the first and major step in
standardization is formulating the directions. This also includes the type of materials to be
used, verbal instructions, time to be taken, the way to handle questions by test takers and all
other minute details of a testing environment. Establishing the norms is also a key step for
standardization. Norm refers to the average performance. To standardize a test, we administer
it to a big, representative sample of the kind of individuals it was designed for. The
aforementioned group sets the norms and is called the standardization sample. The norms for
personality tests are set in the same way as those set for aptitude tests. For both, the norm
would refer to the performance of average individuals. To construct and administer a test,
standardization is a very important. The test is administered on a large set number of the
people (the conditions and guidelines need to be the same for all). After which the scores are
modified using Percentile rank, Z-score, T-score and Stanine, etc. The standardization of a
test can be established from this modified score. Hence, “standardization is a process of
ensuring that a test is standardized, (Osadebe, 2001)”. There are lots of advantages when a
test is standardized. A standard test is usually produced by experts and it is better than teacher
made test. The standardized test is highly valid, reliable and normalized with Percentile rank,
Z-score, T-score among scores derived from others to produce age norm, sex norm, location
norm and school-type norm. Generally, a standardized test could be used to assess, and
compare students in the same norming group. The normal process for administering
standardization includes:

1) A calm, quiet and disturbance free setting


2) Accurately understanding the written instructions, and
3) Provisioning of required stimuli. This makes the normative data applicable to the
individuals being evaluated.

Classification of Standardized Testing:

Norm-referenced Testing: It is used to measure the result or performance in relation to all


other individuals being administered the same test. It can be used to compare an individual to
the others.

Criterion referenced Testing: It is used for measuring the real knowledge of a certain topic.
For example: Multiple choice questions in a geography quiz.
Steps for Constructing Standardized Tests:

A carefully constructed test where the scoring, administration and interpretation of result
follows a uniform process can be termed as a standardized test. Following are the steps that
can be followed to construct a standardised test:
Steps
1) Plan for the test.
2) Preparation of the test.
3) Trial run of the test.
4) Checking the Reliability and Validity of the test.
5) Prepare the norms for the test.
6) Prepare the manual of the test and reproducing the test.

Methods of Data Collection:

1) Planning – There needs to be a systematic planning in order to formulate a standardized


test. Its objectives should be carefully defined. The type of content should be determined
for example using short/long/very short answers or using multiple type questions, etc. A
blue print must be ready with instructions to the method to be used for sampling, making
the necessary requirements for preliminary and final administration. The length, time for
completing the test and number of questions should be fixed. Detailed and precise
instructions should be given for administration of the test and also it’s scoring.

2) Writing the items of the test –


This requires a lot of creativity and is dependent on the imagination, expertise and
knowledge. Its requirements are:
• In-Depth knowledge of the subject
• Awareness about the aptitude and ability of the individuals to be tested.
• Large vocabulary to avoid confusion in writing. Words should be simple and descriptive
enough for everybody to understand.
• Assembly and arrangement of items in a test must be proper, generally done in ascending
order of difficulty.
• Detailed instructions of the objective, time limit and the steps of recording the answers must
be given.
• Help from experts should be taken to crosscheck for subject and language errors.

3) Preliminary Administration – After modifying the items as per the advise of the experts
the test can be tried out on experimental basis, which is done to prune out any inadequacy or
weakness of the item. It highlights ambiguous items, irrelevant choices in multiple choice
questions, items that are very difficult or easy to answer. Also the time duration of the test
and number of items that are to be kept in the final test can be ascertained, this avoids
repetition and vagueness in the instructions. This is done in following three stages:

a) Preliminary try-out – This is performed individually and it helps in improving and


modifying the linguistic difficulty and vagueness of items. It is administered to around
hundred people and modifications are done after observing the workability of the items

b) The proper try-out –


It is administered to approximately four hundred people wherein the sample is kept same as
the final intended participants of the test.
This test is done to remove the poor or less significant items and choose the good items and
includes two activities:
• Item analysis – The difficulty of the test should be moderate with each item discriminating
the validity between high and low achievers. Item analysis is the process to judge the quality
of an item.
• Post item analysis: The final test is framed by retaining good items that have a balanced
level of difficulty and satisfactory discrimination. The blue print is used to guide in selection
of number of items and then arranging them as per difficulty. Time limit is set.

c) Final try-out – It is administered on a large sample in order to estimate the reliability and
validity. It provides an indication to the effectiveness of the test when the intended sample is
subjected to it.

4) Reliability and Validity of the test – When test is finally composed, the final test is again
administered on a fresh sample in order to compute the reliability coefficient. This time also
sample should not be less than 100. Reliability is calculated through test-retest method, split-
half method and the equivalent -form method. Reliability shows the consistency of test
scores. Validity refers to what the test measures and how well it measures. If a test measures
a trait that it intends to measure well then the test can be said to be a valid one. It is
correlation of test with some outside independent criterion.

5) Norms of the final test – Test constructor also prepares norms of the test. Norms are
defined as average performance scores. They are prepared to meaningfully interpret the
scores obtained on the test. The obtained scores on test themselves convey no meaning
regarding the ability or trait being measured. But when these are compared with norms, a
meaningful inference can be immediately drawn. .
The norms may be age norms, grade norms etc. as discussed earlier. Similar norms cannot be
used for all tests.

6) Preparation of manual and reproduction of the test – The manual is prepared as the last
step and the psychometric properties of the test norms and references are reported. It provides
in detail the process to administer the test, its duration and scoring technique. It also contains
all instructions for the test.

Item Construction

Gregory (1992) described five steps in test construction:


(a) defining the test (e.g., purpose, content),
(b) selecting a scaling method (i.e., rules by which numbers or categories are assigned to
responses),
(c) constructing the items (e.g., developing a table of specifications that describes the specific
method employed to measure the test’s content areas ),
(d) testing the items (i.e., administer the items and then conduct an item analysis), and (e)
revising the test (e.g., cross-validate it with another sample because validity shrinkage almost
always occurs).
A researcher evaluating a new mathematics curriculum, for example, might (a) desire a test
that could show changes over time in mathematics skills, (b) assign a score of 1 to each math
item correctly scored, (c) create a table of specifications indicating what kind of skills would
be expected to be acquired, (d) run a study to determine which items were sensitive to
change, and (e) repeat the process with the selected items with a new group of students.

DeVelli’s (1991) provided several simple guidelines for item writing.

 Define clearly what you want to measure: as specific as possible.


 Generate an item pool. Avoid exceptionally long items, which are rarely good.
 Keep the level of reading difficulty appropriate for those who will complete the
scale.
 Avoid “double-barreled” items that convey two or more ideas at the same time.
 For example, consider an item that asks the respondent to agree or disagree with
the statement, “I vote this party because I support social programs.”
 Consider mixing positively and negatively worded items. Sometimes,
respondents develop the “acquiescence response set.”
Item Formats
 The dichotomous format.
 The polytomous format.
 The Likert format.
 The category format.
 Checklists and Q-sorts.

Test item construction is the process of designing and developing questions or prompts used
in psychological assessments to measure specific psychological constructs such as cognitive
abilities, personality traits, or emotional states. Properly constructed test items are crucial for
ensuring the validity, reliability, and fairness of psychological tests. The quality of these
items directly impacts the accuracy of the test results and the conclusions drawn from them.

How It Is Ensured That Items Are Appropriately Constructed:

1. Define the Construct:


o Clarify Objectives: Start by clearly defining what psychological construct the
test is intended to measure (e.g., intelligence, anxiety, self-esteem).
o Develop a Framework: Use theoretical frameworks and existing research to
inform the construction of items, ensuring they align with the intended
construct.

2. Create Item Types:


o Select Formats: Choose appropriate item formats based on the construct
being measured. Common formats include multiple-choice, true/false, short
answer, essay, and Likert scale items.
o Design Questions: Develop questions or prompts that accurately reflect the
construct and are suitable for the test’s format.

3. Ensure Relevance and Clarity:


o Content Relevance: Each item should be directly related to the construct and
objectives of the test. Avoid irrelevant or extraneous content.
o Clear Wording: Use clear, precise language to avoid ambiguity. Ensure that
items are straightforward and easily understandable by the target population.
4. Maintain Fairness and Bias Control:
o Avoid Bias: Ensure items are free from cultural, gender, or social biases.
Items should be equitable and accessible to all test-takers.
o Pilot Testing: Conduct preliminary tests to identify and address potential
biases or misunderstandings.

5. Assess Difficulty and Complexity:


o Appropriate Difficulty: Design items that are appropriately challenging for
the target population. Avoid items that are too easy or too difficult.
o Cognitive Load: Ensure that the complexity of items matches the cognitive
abilities of the test-takers.

6. Item Analysis and Refinement:


o Pilot Testing: Use pilot studies to test items with a sample from the target
population. Collect data on item performance and gather feedback.
o Statistical Analysis: Analyze item difficulty, discrimination, and reliability.
Techniques like item response theory (IRT) and classical test theory (CTT)
can be used to refine items based on statistical metrics.

7. Review and Revise:


o Expert Review: Have items reviewed by experts in the field to ensure they
meet theoretical and practical standards.
o Continuous Improvement: Regularly review and update items based on
feedback, performance data, and evolving research.

8. Ethical Considerations:
o Confidentiality: Ensure that responses are kept confidential and used only for
their intended purpose.
o Informed Consent: Obtain consent from participants, clearly explaining the
purpose of the test and how the data will be used.

Item analysis
Item analysis is a procedure by which we analyse the items to judge their suitability or
unsuitability for inclusion in the test. As we know, the quality
or merit of a test depends upon the individual items which constitute it. So only those items
which suit our purpose are to be retained. Item analysis is an integral part of the reliability
and validity of a test. The worth of an item is judged from three main angles viz.

a) Diffculty index of the item


b) Discriminating power of the item
c) Its internal consistency with the whole test.

a) Item difficulty
When an item is too easy, all the students would answer it. If it is too hard, nobody would
answer it. What is the use of having such items in a test? If all the students get equal scores,
the very purpose of the test (i.e. to assess the ability of students) is defeated. So it is clear that
too easy and too difficult items are to be totally discarded. It is desirable that items of a
medium difficulty level must be included in a test. Item difficulty is calculated by different
methods.

Method 1: Item difficulty (I.D.) is calculated by using the formula. ID = X 100 where R = no.
of testees answering correctly, and N = Total no. of testees. If in a test administered to 50
pupils an item is passed by (i.e. correctly marked by) 35 students the I.D = X 100 = 70 Here,
we understand that the item is easy. In essence, if the I.D. value is more, the item is easy and
if the I.D. value is less than the item is considered to be difficult. Usually I.D. values in
between 16 and 84 (or 15 to 85) are retained.

(b) Discriminating index: To be considered good, an item must have discriminating power.
For example, if an item is too easy or too difficult to all the testees, it can’t discriminate
between individuals. Logically, it is expected that a majority of students of a better standard
and a few students of lower standard will answer an item correctly. Thus, an item must
discriminate between persons of the high group and the low group. In other words,

WL = Number of persons in the lower group (i.e. 27% of N) who have wrongly answer an
item or omitted it.

WH = Number of persons in the higher group who have wrongly answered an item or
omitted it.

It is expected that WL will be always more than WH i.e., WL - WH will always be positive.
If WH is more than WL the item is either ambiguous and it is to be totally rejected.

We need to calculate the WL - WH value for each item.


Fr example we can fnd that for a’n’ of 120, the minimum WL - WH, value for an item with 4
options should be 16. So, all the items whose WL-WH. value is 16 or above are considered to
be sufficiently discriminating. If WL-WH value of an item is less than 16, it is to be rejected-

c) Internal consistency of items with the whole test

Statistical methods are used to determine the internal consistency of items. Biserial
correlation gives the correlation of an item with its sub-test scores and with total test-scores.
This is the process of establishing internal validity. There are also other methods of assessing
internal consistency of items and as they are beyond the scope of our present purpose, we
have not discussed them here.

Item-total correlation is a critical concept in item analysis, particularly in classical test theory.
Definition: The item-total correlation is the correlation between a single test item’s score and
the total score of the test (excluding the item in question). It measures the relationship
between how a person scores on a specific item and their overall performance on the test.

Purpose: This statistic helps determine how well each item on a test contributes to the overall
measurement of the construct. High item-total correlations indicate that the item is a good
measure of the underlying construct being tested, as it is consistent with the total test score.

Calculating Item-Total Correlation:

1. Score Calculation:

- For each test-taker, calculate the total score of the test, excluding the item of interest. This
gives the total score that excludes the specific item.

2. Correlation Computation:

- Compute the Pearson correlation coefficient between the item scores and the total scores
(excluding the item). This can be done using statistical software or formulas.

Interpreting Item-Total Correlation:

- High Correlation: A high positive correlation (e.g., above 0.30) suggests that the item is
consistent with the overall test score, meaning it contributes well to measuring the construct
and is aligned with the other items on the test.

- Low or Negative Correlation: A low or negative correlation indicates that the item may not
be measuring the same construct as the other items or may be poorly worded or ambiguous.
This could signal a need for revision or removal of the item.

Uses in Test Development:

- Item Selection: Items with high item-total correlations are typically retained in the test as
they contribute effectively to the test's overall reliability.
- Item Revision: Items with low correlations might be revised or removed to improve the
test’s consistency and overall validity.

- Reliability Assessment: Item-total correlations help in assessing the internal consistency of


the test. If most items have high correlations with the total score, it suggests good internal
consistency.

Considerations:

- Contextual Factors: Ensure that the item-total correlation is interpreted in the context of the
test’s purpose and content. Sometimes, an item with a lower correlation might still be
valuable for assessing certain aspects of the construct.

- Avoiding Over-reliance: While item-total correlation is informative, it should not be the


sole criterion for item selection. Consider other factors like item difficulty, discrimination
indices, and content validity.

In summary, item-total correlation is a useful statistic in item analysis for evaluating how
well individual items contribute to the overall test construct. It provides insights into the
effectiveness of each item and helps guide decisions in test development and refinement.

Develop Test Administration

1. Principles of Test Administration

a. Standardization
Administering tests under uniform conditions to ensure fairness and comparability.
- Importance:
- Consistency: Reduces variability in test conditions.
- Reliability: Enhances test consistency by minimizing sources of error.
-Validity: Ensures the test measures what it is intended to measure under consistent
conditions.
- Implementation:
- Follow a standardized administration protocol.
- Train all personnel involved in the administration.
- Document all procedures and conditions.

b. Instructions
Clear, concise, and consistent guidance provided to test-takers.
- Importance:
- Clarity: Ensures test-takers understand what is required.
- Fairness: Provides all test-takers with an equal understanding of test expectations.
- Implementation:
- Use straightforward language.
- Deliver instructions consistently.
- Provide examples if needed.

c. Environment
- Definition: The physical and situational conditions where the test is administered.
- Importance:
- Minimize Distractions: Helps maintain focus and reduces variability.
- Comfort: Supports concentration and reduces test-taker stress.
- Implementation:
- Choose a quiet, controlled location.
- Ensure comfortable seating and functional equipment.
- Maintain appropriate lighting and temperature.

2. Practical Skills
a. Setting Up
Preparing materials and equipment for test administration.
Importance:
- Readiness: Ensures all materials are available and in working order.
- Efficiency: Facilitates smooth administration.
- Implementation:
- Use a checklist to verify all materials.
- Check and prepare any necessary equipment.
- Organize materials systematically.

b. Timing
Managing and adhering to time limits during the test.
- Importance:
- Fairness: Provides equal time for all test-takers.
- Accuracy: Measures performance under timed conditions.
- Implementation:
- Use reliable timing devices.
- Clearly communicate time limits.
- Monitor and manage time during the test.
- Handle any time-related issues consistently.

c. Handling Issues
- Definition: Addressing technical problems or unexpected situations during administration.
- Importance:
- Adaptability: Ensures test continuity despite issues.
- Fairness: Minimizes disruption and potential impact on performance.
- Implementation:
- Develop contingency plans for common problems.
- Document issues and resolutions.
- Provide immediate assistance to test-takers as needed.

3. Ethical Considerations
a. Confidentiality
Protecting test-taker privacy and data.
- Importance:
- Trust: Builds confidence in the testing process.
- Compliance: Meets legal and ethical standards for data protection.
- Implementation:
- Securely store test materials and results.
- Limit access to authorized personnel.
- Follow data protection laws and guidelines.
b. Informed Consent
Providing test-takers with information about the test and obtaining their consent.
- Importance:
- Transparency: Ensures test-takers understand the test and its use.
- Autonomy: Respects the right to make informed decisions.
-Implementation:
- Disclose the purpose, procedures, and potential risks of the test.
- Obtain written or electronic consent.
- Allow test-takers to ask questions and address concerns.

NORMS

Norm refers to the typical performance level for a certain group of individuals. Any
psychological test with just the raw score is meaningless until it is supplemented by
additional data to interpret it further. Therefore, the cumulative total of a psychological test is
generally inferred through referring to the norms that depict the score of the standardized
sample. Norms are factually demonstrated by establishing the performance of individuals
from a specific group in a test. To determine accurately a subject’s (individual’s) position
with respect to the standard sample, the raw score is transformed into a relative measure.
There are two purposes of this derived score: 1) They provide an indication to the individuals
standing in relation to the normative sample and help in evaluating the performance. 2) To
give measures that can be compared and allow gauging of individuals performance on
various tests.
Types of Norms Fundamentally, norms are expressed in two ways, developmental norms and
within group norms.
1) Developmental Norms These depict the normal developmental path for an individual’s
progression. They can be very useful in providing description but are not well suited for
accurate statistical purpose. Developmental norms can be classified as mental age norms,
grade equivalent norms and ordinal scale norms.
2) Within Group Norms This type of norm is used for comparison of an individual’s
performance to the most closely related groups’ performance. They carry a clear and well
defined quantitative meaning which can be applied to most statistical analysis.

a) Percentiles (P(n) and PR): They refer to the percentage of people in a standardized sample
that are below a certain set of score. They depict an individual’s position with respect to the
sample. Here the counting begins from bottom, so the higher the percentile the better the
rank. For example if a person gets 97 percentile in a competitive exam, it means 97% of the
participants have scored less than him/her.

b) Standard Score: It signifies the gap between the individuals score and the mean depicted as
standard deviation of the distribution. It can be derived by linear or nonlinear transformation
of the original raw scores. They are also known as T and Z scores.

c) Age Norms: To obtain this, we take the mean raw score gathered from all in the common
age group inside a standardized sample. Hence, the 15 year norm would be represented and
be applicable by the mean raw score of students aged 15 years.

d) Grade Norms: It is calculated by finding the mean raw score earned by students in a
specific grade
Scoring:

o Objective Scoring: Involves standardized procedures where answers are


marked based on a clear, predefined key or rubric. For example, multiple-
choice questions are scored by counting the number of correct answers.
o Subjective Scoring: Involves evaluator judgment, such as in essays or open-
ended questions. Scorers use rubrics or criteria to maintain consistency but are
still influenced by their interpretation.

Interpretation:

o Descriptive Interpretation: Involves understanding what the scores


represent, such as average scores, ranges, and standard deviations. It helps in
summarizing the data.
o Diagnostic Interpretation: Looks at scores to diagnose specific strengths and
weaknesses or to make educational or psychological assessments.
o Comparative Interpretation: Compares scores against norms or benchmarks,
such as age-related norms or peer performance, to evaluate relative standing.

Tester’s Bias

1. Types of Bias:
o Cultural Bias: Test content may favor one cultural group over another. This
can affect the performance of individuals from different backgrounds.
o Confirmation Bias: Scorers might unconsciously look for evidence that
confirms their pre-existing beliefs or expectations about a test-taker.

o Halo Effect: An overall positive or negative impression of a test-taker can


influence scoring of specific responses.
o Stereotyping: Assumptions based on a test-taker's demographics can impact
scoring and interpretation.

2. Mitigating Bias:
o Training: Provide thorough training for scorers on rubrics and scoring
procedures.
o Blind Scoring: Ensure that scorers do not know the identity or background of
the test-takers.
o Standardization: Maintain consistent test administration and scoring
practices.

Extraneous Factors

1. Environmental Factors:
o Testing Conditions: Noise, lighting, and seating arrangements can impact
performance. Ensuring a controlled and consistent environment is key.
o Health and Well-being: Test-takers’ physical or emotional state can affect
their performance. Stress, fatigue, or illness can skew results.

2. Test Administration:
o Consistency: Variations in how the test is administered (e.g., instructions
given, time limits) can affect results. Adherence to standardized procedures
helps mitigate this.
o Preparation: Test-taker familiarity with the test format and preparation level
can influence scores. Providing preparatory resources or information can help
level the playing field.

3. Test Design:
o Clarity and Fairness: Tests should be well-designed to measure what they
intend to without ambiguity or unfair advantage. Ensuring that the questions
are clear and relevant to the test objectives is crucial.

You might also like