0% found this document useful (0 votes)
42 views7 pages

Item Analysis For Examination Test in The Postgraduate Student's Selection With Classical Test Theory and Rasch Measurement Model

This document discusses item analysis methods for examination tests using classical test theory and the Rasch measurement model. It provides background on classical test theory and its limitations. The Rasch model aims to overcome these limitations by producing measurement scales with consistent intervals. The study compares the two models on validity, reliability, difficulty levels, and distinguishing features using data from a postgraduate entrance exam with 120 multiple choice items and 409 participants. Both models identified problematic items but the Rasch model provided more detailed information and recommended removing more items.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views7 pages

Item Analysis For Examination Test in The Postgraduate Student's Selection With Classical Test Theory and Rasch Measurement Model

This document discusses item analysis methods for examination tests using classical test theory and the Rasch measurement model. It provides background on classical test theory and its limitations. The Rasch model aims to overcome these limitations by producing measurement scales with consistent intervals. The study compares the two models on validity, reliability, difficulty levels, and distinguishing features using data from a postgraduate entrance exam with 120 multiple choice items and 409 participants. Both models identified problematic items but the Rasch model provided more detailed information and recommended removing more items.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Item Analysis for Examination Test in the

Postgraduate Student's Selection with Classical Test


Theory and Rasch Measurement Model
Dedy Triono Riyanarto Sarno
Department of Technology Management Department of Informatics
Institut Teknologi Sepuluh Nopember Institut Teknologi Sepuluh Nopember
Surabaya 60264, Indonesia Surabaya 60111, Indonesia
[email protected] [email protected]

Kelly R. Sungkono
Department of Informatics Engineering
Institut Teknologi Sepuluh Nopember
Surabaya, Indonesia
[email protected]

Abstract— University entrance exams are conducted to ensure necessary. This practical item analysis can be divided into
applicants' qualifications are placed into the program of their two, namely: with the classical test theory approach and item
choice. Test results have important and significant value in response theory (IRT) [1] The test is a measurement technique
making the right decision about the suitability of the applicant; the designed as a systematic procedure for studying the behaviour
validity of the exam is significant to achieve the objectives set. The
purpose of this study is to provide empirical evidence of the validity
of individuals or groups of individuals [2]. In this description,
of the new construct in developing the Academic and English Test two analytical methods that are generally used in developing
Exams using the Classical Test Theory and the Rasch tests, namely traditional or standard item analysis in classical
Measurement Model. Admission Test for postgraduate entrance experiments or Classical Test Theory (CTT) and modern
examination consisting of 120 multiple choice items with five interpretation, are based on item response theory (IRT). These
answers/option choices (A-E) and has been developed and assessed processes generally follow the identification of the objectives
by experts who are competent in their fields and questions are of the test and the preparation of a pool of items in the test
given to 409 postgraduate entrance exam participants. Software preparation process. To produce tests in educational
applications used for CTT and Rasch Model are ITEMAN version measurements, the criteria and guidelines that have been
3 and JMETRIK version 4 windows, where the application is free
of licenses. The software automatically generates parameter
established for the development of valid and reliable tests
estimation recommendations for assessing the quality of test items. must be followed adequately. This provides accurate
The CTT results identified 39 questionable items using difficulty information in the use of tests and construction [1]
and index discrimination. Rasch's results show that the statistics of Analysis of test instruments in education can be done
people (Separation 2.55> 2.00 and reliability 0.87> 0.80) and item through two approaches. The first approach is the most
statistics (Separation 9.4> 3.0 and reliability 0.99> 0.8) are common and is widely applied in education to date, especially
excellent person and item reliability. Overall, using the Rasch in research, namely the classical test theory (CTT). This
model obtained 68 constructs that incorrectly matched items, as statement is following the report [3] in his study entitled "the
well as irrelevant identified, are suggested to be removed. While accuracy of the results of item analysis according to classical
CTT provides limited information on two parameters, Rasch's
results provide very detailed information about the quality of the
test theory and item response theory in terms of sample size,"
items being tested. Thus the two models can be integrated to that classical test theory (CTT) is a popular analytical
produce sufficient evidence of validity and reliability items in the technique. It is used in stock in this century. The conventional
development of standardized tests. Even from the second approach, test theory developed by Charles Spearman in 1904 can be
the model produced 28 items in common as problem items. These used to predict the results of an exam. In classical test theory,
results indicate that more items are recommended for removal by the aspects that largely determine the quality of the items are
the Rash model than the CTT can be linked to the procedure the level of difficulty and the distinguishing features of the
followed by two frameworks in determining the quality of test questions. However, the characteristics of items produced by
items. classical test theories are inconsistent (changing) depending on
Keywords— C.T.T., Rash Model, Item Analysis
the ability of test-takers. According to [4], measurement errors
in classical test theory can only be sought for groups, not
I. INTRODUCTION individuals. A second approach is a modern approach with the
To get high-quality question instruments, in addition to Rasch model coined by Dr Georg Rasch is a Danish
theoretical analysis (item review), empirical analysis is also mathematician. Rasch modelling exists to overcome
weaknesses in classical test theory. Rasch modelling provides items in the test. Even though the level of difficulty of
a different approach to the use of exam scores or raw data in questions and the discrimination power of things are
the context of educational assessment. The aim is to produce a calculated separately, in the evaluation of both issues, the item
measurement scale with the same interval that can later is seen as a unitary component that will determine whether an
provide accurate information about the test taker as well as the item is considered good or not. The third parameter, namely,
quality of the questions being worked on. In other words, the the effectiveness of the distractor, only applies to questions in
analysis of the Rasch model will produce information about the form of multiple choice.
the characteristics of items and students that have been formed
C. Level of difficulty
into the same metric [5]. In this study, a comparative analysis
of the quality of test instruments will be carried out on the The item difficulty index, as stated by [7], is the
elements of validity, reliability, level of difficulty, and "proportion of examinees who get that item correct." The
distinguishing features of the questions through the two statement explains that the level of difficulty of test items is a
approaches described above, namely the classical test theory number that shows the proportion of test participants who can
and the Rasch model. answer the question correctly. In comparison, the level of
difficulty of the test set is a number that shows the average
II. RELATED WORK percentage of test participants who can answer all the test sets.
The formula used to determine the level of difficulty is as
A. Classical Test Theory (CTT) follows Eq. (2)
Classical test theory adopts a deterministic approach
(certainty) wherein the main focus of the analysis is the total (2)
individual score (X). Each test has an error (E) that
accompanies each measurement result in measuring human information:
nature. Pure scores (T) and errors (E) are both latent variables, p: level of difficulty of the test item
but the purpose of testing is to conclude individual absolute nB: number of subjects answering correctly
scores. The score of each item can also be ascertained right n: total number of subjects
and wrong, i.e., for example, if someone's answer is correct,
then given a score of 1 and given a score of 0. While IRT The mathematical model states that the level of difficulty of
focuses on the probability of answering each item where the problem (p) is influenced by the number of participants
assessing solutions is not on someone's total score but who worked on the questions correctly divided by the total
considers one's response/answer at the level of the question. number of participants present.
Giving a rating is also not by determining a score of 1 or 0, but As stated by Allen & Yen, the problem of good things is
the probability of the person getting a score of 1 or a score of from 0.3 to 0.7. Items with difficulty levels below 0.3 are
0. The mathematical formula is called CTT Model is considered heavy items, whereas if the index is above 0.7,
represented in Eq. (1) items are deemed secure [7]. Thus the difficulty level (P)
X =T + E (1) criteria can be written in TABLE I.

This assumption states that the relationship between visible TABLE I. INTERPRETATION OF ITEM DIFFICULTY INDEX
scores (X), pure scores (T), and measurement errors (E) is Difficulty of Index Interpretation
additive. The visible score (X) obtained by an individual is an P ≤ 0.30 Difficult
accumulation of absolute ratings (T) and measurement error 0.31 ≤ P ≤ 0.70 Moderately difficult
(E). P> 0.70 Relatively easy

B. Classical Test Item Analysis D. Item Discrimination Power


In the test preparation process, items that have been The power of discrimination (discrimination) of a test item
qualitatively reviewed by experts in their fields can be is the ability of an object to distinguish between high-ability
declared valid in content. However, in the achievement test, it and low-ability test takers [7]. This understanding explains
is necessary to do additional analysis aimed at obtaining items that the power of different test items is the ability of the test
that have a high degree of measurement and power difference items to differentiate between participants in the upper group
so that the goal of size is to distinguish the abilities of one test (high) and lower group test participants (weak). The
individual from another individual can be achieved. This formula for calculating the power of different test items is as
procedure is often referred to as item analysis and selection follows Eq. (3).
because the purpose of this procedure is nothing but knowing
which items are feasible to be maintained or revised or even (3)
discarded. The process of analyzing and selecting item items
based on classical test theory pays attention to three The formula states that the difference in test items is highly
parameters, namely (1) item difficulty level, (2) item dependent on the results of the whole group of the top
discrimination power, and (3) distractor effectiveness [6]. The answering questions correctly reduced by the entire group of
analysis was carried out based on the subject's answers to the the bottom of the results are positive, then the items can be
accepted. If the "DB" is negative, the problem is terrible and where the reliability coefficient of the test is influenced
must be discarded. Information: by the number of items (k) multiplied by the results of the
distribution of the score variance of the question. i with
information: the total score variance.
nBA : number of subjects who answered correctly in the information:
upper group r1.1: the reliability coefficient of the test device
nBB : the number of participants in the lower group who k: many test items
answered correctly SDi2: score variance per item
nA : number of items in the top group SDt2: whole score variant
nB : the number of participants in the smaller group The reliability level of the instrument can be determined
So these statistics show the extent to which a test from the value of r can be seen in TABLE III.
successfully distinguishes between people with high ability TABLE III. INSTRUMENT RELIABILITY LEVEL (R)
and people with low knowledge. Different power groupings
according to [8], are presented in TABLE II. Item Reliability Level of item
r ≤ 0.20 Very Poor
0.20 < r ≤ 0.40 Poor
TABLE II. DIFFERENTIAL POWER CRITERIA (DB)
0.40 < r ≤ 0.60 Medium
Item Discrimination Quality of Item 0.60< r ≤ 0.80 High
D ≥ 0.40 Good question 0.80< r ≤ 1.00 Very High
0.30 ≤ D ≤ 0.39 Questions received and corrected
0.20 ≤ D ≤ 0.29 Problem corrected F. Effectiveness of the distractor
D ≤ 0.19 Problem rejected Each multiple-choice test has one question and several
answer choices. Among the choices of answers, only one is
Thus, item parameters such as the difficulty index and correct. Apart from the right answer, it is the wrong answer.
discrimination index are characteristics that depend on the The wrong answer is what is known as a distractor. Thus, the
sample group used to calculate it. If the test group has a high effectiveness of the distractor is how well the wrong choice
ability, the difficulty index of the test items will below. But on can fool test-takers who do not see the answer keys available.
the contrary, if the test group has a low skill, then the index of The more test takers choose the distractor, and the distractor
the difficulty of the test items will be high, likewise, with the can perform its function correctly. How to analyze the role of
characteristics of other test items. So the value of the the distractor can be done by analyzing the pattern of the
components of the questions will be influenced by the ability dissemination of answer items. The design of the distribution
of one group of test-takers. of answers, as said [11], is a pattern that can illustrate how the
test taker can determine the choice of solutions to the possible
E. Item Reliability (Test level)
answers that have been paired on each item. According to
Reliability comes from the word reliability, which can be Depdikbud (1993: 27), a distractor can be said to function
interpreted as something that can be trusted. In the same case, correctly if chosen by at least 5% for 4 choices of answers and
Drost states that "reliability is a major concern when a 3% for 5 options of solutions. Meanwhile, according to
psychological test is used to measure some attributes or Fernandes (1984: 29), distractors are said to be good if chosen
behaviour" [9]. This definition says that reliability is by at least 2% of all participants. Distractors who do not meet
trustworthiness, reliability, constancy, consistency, or stability. these criteria should be replaced with other distractors who
There are several types of safety, namely: (1) internal may be more attractive to test takers to select them.
consistency, (2) security, and (3) equivalent. The internal Deceiver needs to be made in such a way as to attract the
consistency reliability of the measuring instrument can be attention of test-takers who do not yet have a good concept of
calculated using the formula AlphaCronbach, Kuder- the material being tested. [7] states that a minimum good
Richardson (KR20 or KR21), and the Split Technique. deception indexed 0.1 in the form of a biserial point
Suparwoto states that the AlphaCronbach coefficient can be correlation coefficient, positive value for the answer key, and
used for item analysis with a score of true and false 0, or with negative value for a trick.
a score of 1, 2, 3 sequentially and this method is an effort to
determine the reliability coefficient of the instrument/test that III. METHODOLOGY
refers to the concept of internal consistency [10] The formula
used to calculate the Alpha-Cronbach ratio is as follows A. Design
Eq.(4). The purpose of this study was to analyze the level of
difficulty of items and the ability of people to use two
measurement frameworks, namely the Classical Test Theory
(4) (CTT) and the Rasch Measurement Model (R.M.M.). A total
of 410 undergraduate students from various Departments of
the Faculty of Education conducted the 2018 Postgraduate
entrance examination, which included the Academic Potential
Test and the English Research Methodology. This consists of name of the Rasch model, is parameter b, parameter a is
120 multiple choice items with five answers/option (A-E), assumed to be equal to 1. In contrast, parameter c is
where is the time for the exam 150 minutes for TPA and considered to be zero (c = 0). Estimation of a person's abilities
English test where time and participants support the stability and estimation of item parameters from a model is selected
of data from Classical Test theory. and obtained from data provided by respondents (test-takers).
B. Data Analysis IV. GUIDELINES FOR ITEM ANALYSIS USING
Data analysis using Classic Item Analysis and Rasch THE RASCH MODEL
Model Approach is using the Iteman Software Application
version 3 for Classic Analysis and JMetrik version 4 windows A. One Parameter Logistic (Rasch Measurement Model)
for Rasch Analysis. The parameters used to assess item There are three widely used IRT models namely One-
quality in CTT are Item Difficulty, Discrimination, and Parameter Logistics Model (1-PL), Two-Parameter Logistics
Reliability. In Rasch analysis, three different stages of Model (2-PL) and Three-Parameter Logistics Model (3-PL)
estimation are considered, (i) Calibration of test-takers' where each of these models has their parameters One key
abilities and item difficulties (ii) Match estimation (iii) component that distinguishes this model is the Item
Assessment of unidimensionality using Principal Component Characteristics Curve (ICC) which graphically displays
Analysis (PCA) of Rasch residues [12]. information of each item produced by I.R.T. One-Parameter
Logistics Model (1-PL) also known as Rasch Model is the
C. IRT METHOD
most basic model in IRT which estimates only one parameter,
In the CTT method, the item difficulty level depends on the difficulty parameter (b) [14]
the ability of the test taker. If the test taker's skill is high, the In 1- PL, item discrimination level (a) and guess
item difficulty level is low, and vice versa, if the test taker's probability (c) is assumed to be constant [15]. In the 1-PL
ability is little, then the item difficulty level becomes high. model, the ICC for each item is given the equation below Eq.
The level of item discrimination and reliability depends on the (5).
heterogeneity and distribution of the test-takers' capabilities.
The ability of test-takers is interpreted in the correct number of (5)
scores. In IRT, the strength of participants is not affected by
the characteristics of the items, and the features of the where Pi (Ө) is the probability of students with (Ө) ability to
questions are not affected by the ability of individuals. The respond to item-i correctly, and b is the level of difficulty of
essence of IRT is the level of difficulty of objects, and item-i. The bi value typically ranges from -2 to 2 but can take
individual skills are measured on the same scale. So that a more extreme values. As noted in [14], b and scaled using a
match is needed between the model and the data. IRT is a normal distribution with a standard deviation of 1.0 and a
statistical theory that contains a mathematical model that states mean of 0.0, hence [15] presents two summaries of this
the probability of specific responses to individual items as a equation:
function of one's abilities and particular characteristics of an (i) The smoother the item is, the high probability students will
object [13]. The item response theory is also often referred to answer it correctly
as the latent trait theory, which is a very significant (ii) Students with high ability are more likely to answer
development in the fields of education and psychology questions correctly compared to students
measurement. with low ability.
The latent trait theory uses three primary concepts in
developing measurement models, namely potential space B. Item Compatibility Level
dimensions, local independence, and item characteristic curves Matching items mean that they behave consistently with
[13]. This theory states that a person's behaviour can be what is expected by the model. If it is found that the questions
explained to a certain degree for the characteristics of that are not fit, this is an indication that there is a misconception in
person. These characteristics vary, for example, verbal ability, students about the item. The fit index provided in the Rasch
quantitative, psychomotor. This characteristic is also called a analysis is ZSTD Person Infit, ZSTD Person Outfit, MNSQ.
trait. A person's position on a character can be used to estimate Infit Person, MNSQ. Person Outfit, ZSTD Infit Item, ZSTD
the magnitude of the person's abilities. This trait is often Outfit Item, MNSQ. Infit Item, MNSQ. Outfit Item [16].
expressed as a person's ability dimension. The three logistical MNSQ. Values are always positive and move from zero
parameter model (3PL) is parameter a (different power), (0) to infinity (∞). In this case, the MNSQ. Value is used to
parameter b (difficulty level), parameter c (guess) when a monitor the suitability of the data with the model. The
person's possible correct response to a particular item is expected mean square value is 1 (one). A mean-square value
expressed as one's ability. Furthermore, this expression is for infit or outfit higher than one, say 1.3, indicates that the
referred to as the Item Characteristics Curve (ICC). The two- observed data has 30% more variation than predicted by
parameter logistics model (2PL) is parameter a, parameter b, Rasch. The infit or outfit value is less than 1, say 0.78 (1-0.22
and parameter c. It is assumed that everyone who has low = 0.78), indicating that the observed data has 22% fewer
testability has no chance of success in answering the item (c = variations than predicted by the Rasch model [12]. At the
0). The one logistic parameter model (1PL) or known by the same time, the expected value of z is close to 0 (zero). When
the data observed is following the model, the amount of z has D. Item Difficulty Level (Item Measure)
an average approaching 0, and the standard deviation is 1. The The level of difficulty of the items in the IRT model is the
ZSTD value that is too large (z> +2) or too low (z <-2) same as the CTT, which is the ratio of the number of correct
indicates the items do not match the expected model. answers to the number of questions tested. The difference is
Standardized z values (ZSTD) on infit and outfit can be either that the probability value is scaled by entering the logarithmic
positive or negative. A negative ZSTD value indicates less function. The logarithm estimation results from odd-ratio are
variation compared to the model. Response answers approach called measure values. If in classical test theory, a high
the Guttman-style response string model that all subjects with difficulty index value means that the problem is easy, in
high ability can answer correctly, and all questions with low Rasch, the top logit value model indicates the item is delicate.
ability answer incorrectly on these items. At the same time, Just like in classical test theory, there is no standard of what
positive values indicate that the variation of solutions is more level of difficulty is received in the test. This depends on the
than in the model. Response responses are irregular and purpose of the test itself.
unpredictable [12].
According to [16], the criteria used to check the V. RESULT AND DISCUSSION
appropriate item are :
1. The appropriate Outfit Mean Square (MNSQ) value: 0.5 A. CTT analysis result
<MNSQ <1.5 The results of item analysis using CTT consider three (3)
2. The appropriate Z-standard (ZSTD) outfit values: -2.0 parameters in judging the quality of items to be used in
<ZSTD <+2.0 assessing students' abilities; these are item difficulties (p),
If the items in the two criteria are not fulfilled, it means the Item discrimination (D), and reliability (r). The results are
items are not good and need to be revised or replaced. Unlike presented in TABLE IV and TABLE V.
the level of difficulty of consistent items, the level of Summary statistics presented in TABLE IV shows that,
suitability of this item is strongly influenced by the size of the for the total number of 120 items with 409 examinees, the
sample size. The answer key error, caused by the large number mean score was 68.03 and Standart deviation = 12.52. The
of participants working on careless problems, and questions mean item difficulty and biserial are 0.57 and 0.35,
that have low discrimination so that it can reduce the respectively. These statistics revealed that the test has a
suitability value of items. Another thing to be noticed is, this
sufficient reliability index according to CTT because an index
ZSTD value is susceptible to the number of samples. If the
sample used is large (> 500), there is a tendency for this ZSTD of 0.87, which is higher than the recommended value of 0.70
value to have a value above 3. Therefore, some experts [1].
recommend not using this ZSTD criterion if the sample used is
TABLE IV. SUMMARY ITEM STATISTICS
large enough [17]
Parameter Value
C. Rasch Discrimination Power (Point Measure Total Test Questions 120
Number of participants 409
Correlation)
Alpha Reliability Coefficient 0.870
The Rasch Discrimination Power or item score and Rasch Average Participant Score 68.034
(Pt Measure Corr) score correlation principle is, in principle, Standard Deviation 12.519
the same as the item discrimination power measured by the Item Difficulty Level 0.567
CTT approach. It's just that in classical test theory, the Biserial Average 0.347
computation uses raw scores. The Pt Measure Corr used is
measure scores. Pt Measure Corr value of 1.0 indicates that all The item difficulty averages 0.567 is within the standard
examinees with low ability answer the items incorrectly, and required for items that are quite difficult, with a discrimination
all test participants with high ability answer the items index of 0.347 without revision for all tests [1]. The results
correctly. While the Pt Measure Corr value is negative, presented in TABLE V below show that CTT Item analysis
indicating items that are misleading because the examinees shows that 81 or 67.5% items have satisfactory item statistics
with low ability can answer the items correctly and test (D> 0.19).
participants with high ability answer incorrectly. Problems These items are satisfied with the minimum requirements
with negative correlation values must be checked to see for inclusion into the final version of a test with a minor
whether the answer key is wrong, needs to be revised, or revision. However, 39 (32.5%) based on the established
deleted from the test [1]. criteria are recommended to be eliminated from the analysis
As with classical test theory, the ideal correlation score having (D ≤ 0.19). This means that 39 of these defective items
for grain and Rasch scores is positive and not close to zero. are not appropriate and may not be included in the final draft
Some experts have opinions about how much Pt Measure Corr test. The internal consistency reliability of the test items was
is required. Alagumalai, Curtis, & Hungi (2005) classify these assessed and found to be acceptable with Cronbach's alpha
values to be very good (> 0.40), good (0.30-0.39), sufficient value of 0.870 (TABLE IV).
(0.20-0.29), unable to discriminate (0 - 0.19), and requires
examination of items (<0)
TABLE V. CTT ITEM ANALYSIS CHART TABLE VII. ITEM STATISTICS VALUE OUTFIT MEAN SQUARE (MNSQ)
Difficult High Moderate Easy (>0.70) Total ZSTD value Item Total
y Index Difficult (0.31≤0.70) 0.5< MNSQ < 1.5 All items except items Fail 92
(<0.30) Received
Excellen 74,93 25,28,57,63,64,66, 54,75,80,82 23 0.5 < MNSQ < 2,3,4,5,6,7,8,9,10,34,36,37,41,45,52,7 28
t ≥ 0.40 71,72,78,81,84,88, 1.5 Failed 0,83,94,99,100,102,106,111,112,115,
92,95,96,104,114 116,118,119
Good 49,67,97,10 20,27,38,46,48,53, 1,22,23,24,3 35
0.30 ≤ D 5,107,113 59,60,79,89,101,1 0,32,33,35,4 TABLE VIII. ITEM STATISTICS Z-STANDARD (ZSTD) OUTFIT VALUE
≤ 0.39 09 0,43,50,76,8
6,90,108,65, ZSTD value Item Total
87 -2.0 < ZSTD < All items except items Fail 85
Margina 69,73,103,1 17,55,61,62,77,91, 15,16,21,26, 23 +2.0 Received
l 0.20 ≤ 16 98,110,111,117 29,31,44,56, -2.0 < ZSTD < 2,5,6,7,8,9,10,25,27,34,36,41,45,52,54,57,66, 35
D ≤ 0.29 58 +2.0 Failed 70,71,72,74,75,78,81,84,88,92,94,95,100,102
Poor ≤ 3,4,6,42,45, 2,5,7,8,9,10,34,36, 11,12,13,18, 39 104,111,114,115
0.19 47,68,83,99 41,52,70,85,94,10 19,37,39,51
,106,115,11 0,102,112 D. Misfitting or problematic items by CTT and Rasch
8,119,120 The misfitting items, otherwise known as problematic or
defective items, were identified using the two approaches. The
B. Rasch measurement results (Point Measure Correlation) 'problematic' issues identified by each framework and the
Point Measure Correlation value received: 0.4 <pt measure everyday items identified are presented in TABLE IX.
corr <0.85. Because the point measure correlation is in
TABLE IX. PROBLEMATIC ITEMS DETECTED BY CTT AND RASCH
principle the same as the point-biserial correlation in classical
test theory,[1] classify the value of Point Measure Correlation Model Number Item Deleted
to be very good (> 0.40), good (0.30-0.39), sufficient (0.20- Detected
0.29), unable to discriminate (0-0.19), and requires Rasch 68 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,18,19,
examination of items (<0). 25,27,34,35,36,37,39,41,42,44,45,47,51,52
,55,61,68,69,70,71,72,73,74,75,78,81,83,84
,85,88,91,92,94,95,97,98,99,100,102,103,
TABLE VI. RASCH ITEM ANALYSIS CHART 104,106,111,112,114,115,116,117,118,119
,120
Difficult High Moderate Easy Total
y Index Difficult (0.31≤0.70) (>0.70) CCT 39 2,3,4,5,6,7,8,9,10,11,12,13,18,19,34,36,37,
(<0.30) 39,41,42,45,47,51,52,68,70,83,85,94,99,
Excellen 74 25,57,66,84,88,10 54,75,80,90 12 100,102,106,112,115,118,119,120
t ≥ 0.40 4,114 Common 28 1,14,15,25,27,35,44,55,61,69,71,72,73,
Good 93,105,115 20,27,28,38,48,63 23,24,32,33, 27 Item 74,75,78,81,84,88,91,92,95,97,98,102,
0.30 ≤ D ,64,72,78,89,95,9 40,43,65,82, Detected 103,104,111,114,116,117
≤ 0.39 6,109 86,108,87

Margina 49,67,107,11 17,46,53,59,60,62 16,21,22,26, 26 After being analyzed based on two methods in which the
l 0.20 ≤ 3,118 ,77,79,101,110 29,30,31,50, classical theory of the model produced 39 items with poor
D ≤ 0.29 56,58,76,
discrimination index category (≤ 0.19) while using the Rash
Poor ≤ 3,4,6,42,45,4 52,55,2,5,7,8,9,10 1,11,15,18,3 55 model theory, 68 items did not meet the criteria. Even from
0.19 7,69,73,97,9 ,34,36,41,61,85,9 5,39,44,51,1 the second approach, the model produced 28 things in
9,103,116,11 1,98,111,112,117, 2,13,14,19,3 common as problem items. These results indicate that more
8,119,68,83, 70,94,100,102 7 details are recommended to be deleted by Rash than CCT can
106,120
be linked to the procedure followed by two frameworks in
C. Item Fit Level determining the quality of test items.
According to [16], the value of means-square outfit, z- Whereas CTT relies on two parameters namely item
standard outfit, and point measure correlation are the criteria difficulty and discrimination, Rasch is not limited to item
used to see the level of conformity of items. If there are items parameters by adding Discrimination Power and Item Fit all
that do not meet the criteria, then the item should be repaired contribute to the valuation of things that are not appropriate,
or replaced. Guidelines for assessing item conformity criteria Example item 93 was identified by Rasch and CTT as delicate
according to Boone et al. (2014) are as follows. items, but CTT classifies it as Excellent items because the
▪ Accepted Outfit Mean Square (MNSQ) value: 0.5 discrimination index ignores the level of difficulty.
<MNSQ <1.5 Looking at the results, some items that were identified as
▪ Received Z-standard (ZSTD) outfit values: -2.0 <ZSTD items that were not appropriate by the CTT were classified as
<+2.0 necessary by providing more detailed information based on the
ability of the participants. While a participant's expertise in
CTT is determined based on a raw (total) score on the test,
Rasch's interpretation of the participant's ability is based on
the participant's response to delicate and natural items. In 3. Hidayati, K., Keakuratan Hasil Analisis Butir Menurut Teori Tes
Klasik dan Teori Respons Butir Ditinjau dari Ukuran Sampel. 2002.
CTT, students with the same total score will be interpreted to
4. Wahyuni, K.M.d.S., Analisis Kemampuan Peserta Didik Dengan
have the same capacity. Model Rasch in Seminar Nasional Evaluasi Pendidikan. 2014
However, in IRT students with the same total score will be Semarang. p. 9.
interpreted to have different abilities, if one scores more on a 5. Waller, S.P.R.a.N.G., Item Response Theory, and Clinical
Measurement. The Annual Review of Clinical Psychology 2009. : p.
more natural item and another score on a delicate question.
27–48.
Students who print more difficult questions will be interpreted 6. Petrillo, J., Cano, S. J., McLeod, L. D., Coon, C. D., Using classical
to have higher abilities. Whereas the CTT difficulty score of test theory, item response theory, and Rasch measurement theory to
the item indicates how difficult or easy the subject is in the test evaluate patient-reported outcome measures: a comparison of worked
examples. Value Health, 2015. 18(1): p. 25-34.
for the group of examinees, the Rasch measurement provides a
7. Afraa Musa, S.S., Abdelmoniem Elmardi, Ammar Ahmed, Item
better interpretation of the spread of the item difficulty difficulty & item discrimination as quality indicators of physiology
concerning the test participant's level of ability. Rasch made MCQ examinations at the Faculty of Medicine Khartoum University.
this feasible through mapping facilities [18] Khartoum Medical Journal, 2018. Vol. 11: p. 1477 - 1486.
8. Risa Syukrinda, W.R., Scoring on Multiple-Choice Test and
VI. CONCLUSION Achievement Motivation on Geography Learning Outcomes. American
Journal of Educational Research, 2016. Vol. 4 No. 15.
The main objective of this study is to provide empirical 9. Drost, E., Validity, and Reliability in Social Science Research.
evidence of the validity of the construct as well as the Education Research and Perspectives, 2011. 38: p. 105-124.
10. Setyawarno, D., Penggunaan AplikasiSoftware Iteman (Item and Test
reliability of the Student Entrance Examination Test Analysis) untuk Analisis Butir Soal Pilihan Ganda Berdasarkan Teori
developed for State Universities using the traditional Tes Klasik. Ilmu Fisika dan Pembelajarannya, 2017. 1.
Classical Test Theory and the Rasch Measurement Model 11. Jinnie Shin, Q.G.a.M.J.G., Multiple-Choice Item Distractor
(R.M.M.). More important is to identify the suitability / Development Using Topic Modeling Approaches. Frontiers in
Psychology, 2019. 10.
inappropriate or good or bad items that will be maintained or 12. Andri Syawaludin, Y.S.a.W.R., RASCH Model Application for
eliminated from the test when the two CTT and RMM Validation of Measurement Instruments of Student Nationalism.
frameworks are used and then identify the strengths and or International Conference on Education, 2019. Vol. 5(Issue 2): p. 26-
weaknesses of each of the two approaches in test 42.
13. Mahmud, J., Item response theory: A basic concept. Academic
development and validation. journals, 2017. 12(5): p. 258-266.
The findings of this study indicate that more items 14. Tobore, M.A.-i., & Prof. Andrew Igho Joe, Development and
recommended for removal by Rash than CTT might be Standardization of Adolescents' Social Anxiety Scale Using the One-
related to the technique followed by two approaches in Parameter Logistic Model of Item Response Theory. International
Journal of Innovative Social & Science Education Research, 2018. 6:
determining the features of test items. Whereas CTT depends p. 70-76.
on two parameters of item difficulty and discrimination, 15. Magno, C., Demonstrating the Difference between Classical Test
Rasch is not limited to item parameters; besides item Theory and Item Response Theory Using Derived Test Data. The
parameters, reliability of people, item maps, fit statistics, and International Journal of Educational and Psychological Assessment,
2009. 1(1): p. 1-11.
bullies all contribute to the assessment of item 16. William J.Boone , J.R.S., Melissa S.Yale, Rasch Analysis in the
incompatibility. Based on these findings, the selection of Human Sciences. 2014, New York London: Springer Dordrecht
psychometric procedures depends on many elements. Still, Heidelberg.
the interpretation using Rasch provides more detailed 17. Widhiarso, B.S.W., Aplikasi Model Rasch Untuk Penelitian Ilmu
Sosial. 2014.
information about the structure of the items needed for a valid 18. Adibah Binti Abd Latif, NFMA, Wilfredo Herrera Libunao,Ibnatul
assessment of the ability and suitability of students of the Jalilah Yusof and Siti Sarah Yusri Multiple-choice items analysis
things to measure the desired results. using classical test theory and rasch measurement model. Man in
India, 2016. 96: p. 173-181.
ACKNOWLEDGMENT
The authors would like to sincerely thank Institut
Teknologi Sepuluh Nopember, the Directorate of Higher
Education, Indonesian Ministry of Education and Culture, and
LPDP through RISPRO Invitation Program for funding the
research.

REFERENCES

1. Ado Abdu Bichi, R.T., Noor Azean Atan, Halijah Ibrahim, Sanitah
Mohd Yusof, Validation of a developed university placement test
using classical test theory and Rasch measurement approach.
International Journal Of Advanced And Applied Sciences, 2019. 6(6):
p. 22-29.
2. Ado Abdu Bichi, R.T., Rahimah Embong, Hasnah Binti Mohamed,
Mohd Sani Ismail, Abdallah Ibrahim, Rasch-Based Objective Standard
Setting for University Placement Test. Eurasian Journal of Educational
Research, 2019. 19(84): p. 1-14.

You might also like