Item Analysis For Examination Test in The Postgraduate Student's Selection With Classical Test Theory and Rasch Measurement Model
Item Analysis For Examination Test in The Postgraduate Student's Selection With Classical Test Theory and Rasch Measurement Model
Kelly R. Sungkono
Department of Informatics Engineering
Institut Teknologi Sepuluh Nopember
Surabaya, Indonesia
[email protected]
Abstract— University entrance exams are conducted to ensure necessary. This practical item analysis can be divided into
applicants' qualifications are placed into the program of their two, namely: with the classical test theory approach and item
choice. Test results have important and significant value in response theory (IRT) [1] The test is a measurement technique
making the right decision about the suitability of the applicant; the designed as a systematic procedure for studying the behaviour
validity of the exam is significant to achieve the objectives set. The
purpose of this study is to provide empirical evidence of the validity
of individuals or groups of individuals [2]. In this description,
of the new construct in developing the Academic and English Test two analytical methods that are generally used in developing
Exams using the Classical Test Theory and the Rasch tests, namely traditional or standard item analysis in classical
Measurement Model. Admission Test for postgraduate entrance experiments or Classical Test Theory (CTT) and modern
examination consisting of 120 multiple choice items with five interpretation, are based on item response theory (IRT). These
answers/option choices (A-E) and has been developed and assessed processes generally follow the identification of the objectives
by experts who are competent in their fields and questions are of the test and the preparation of a pool of items in the test
given to 409 postgraduate entrance exam participants. Software preparation process. To produce tests in educational
applications used for CTT and Rasch Model are ITEMAN version measurements, the criteria and guidelines that have been
3 and JMETRIK version 4 windows, where the application is free
of licenses. The software automatically generates parameter
established for the development of valid and reliable tests
estimation recommendations for assessing the quality of test items. must be followed adequately. This provides accurate
The CTT results identified 39 questionable items using difficulty information in the use of tests and construction [1]
and index discrimination. Rasch's results show that the statistics of Analysis of test instruments in education can be done
people (Separation 2.55> 2.00 and reliability 0.87> 0.80) and item through two approaches. The first approach is the most
statistics (Separation 9.4> 3.0 and reliability 0.99> 0.8) are common and is widely applied in education to date, especially
excellent person and item reliability. Overall, using the Rasch in research, namely the classical test theory (CTT). This
model obtained 68 constructs that incorrectly matched items, as statement is following the report [3] in his study entitled "the
well as irrelevant identified, are suggested to be removed. While accuracy of the results of item analysis according to classical
CTT provides limited information on two parameters, Rasch's
results provide very detailed information about the quality of the
test theory and item response theory in terms of sample size,"
items being tested. Thus the two models can be integrated to that classical test theory (CTT) is a popular analytical
produce sufficient evidence of validity and reliability items in the technique. It is used in stock in this century. The conventional
development of standardized tests. Even from the second approach, test theory developed by Charles Spearman in 1904 can be
the model produced 28 items in common as problem items. These used to predict the results of an exam. In classical test theory,
results indicate that more items are recommended for removal by the aspects that largely determine the quality of the items are
the Rash model than the CTT can be linked to the procedure the level of difficulty and the distinguishing features of the
followed by two frameworks in determining the quality of test questions. However, the characteristics of items produced by
items. classical test theories are inconsistent (changing) depending on
Keywords— C.T.T., Rash Model, Item Analysis
the ability of test-takers. According to [4], measurement errors
in classical test theory can only be sought for groups, not
I. INTRODUCTION individuals. A second approach is a modern approach with the
To get high-quality question instruments, in addition to Rasch model coined by Dr Georg Rasch is a Danish
theoretical analysis (item review), empirical analysis is also mathematician. Rasch modelling exists to overcome
weaknesses in classical test theory. Rasch modelling provides items in the test. Even though the level of difficulty of
a different approach to the use of exam scores or raw data in questions and the discrimination power of things are
the context of educational assessment. The aim is to produce a calculated separately, in the evaluation of both issues, the item
measurement scale with the same interval that can later is seen as a unitary component that will determine whether an
provide accurate information about the test taker as well as the item is considered good or not. The third parameter, namely,
quality of the questions being worked on. In other words, the the effectiveness of the distractor, only applies to questions in
analysis of the Rasch model will produce information about the form of multiple choice.
the characteristics of items and students that have been formed
C. Level of difficulty
into the same metric [5]. In this study, a comparative analysis
of the quality of test instruments will be carried out on the The item difficulty index, as stated by [7], is the
elements of validity, reliability, level of difficulty, and "proportion of examinees who get that item correct." The
distinguishing features of the questions through the two statement explains that the level of difficulty of test items is a
approaches described above, namely the classical test theory number that shows the proportion of test participants who can
and the Rasch model. answer the question correctly. In comparison, the level of
difficulty of the test set is a number that shows the average
II. RELATED WORK percentage of test participants who can answer all the test sets.
The formula used to determine the level of difficulty is as
A. Classical Test Theory (CTT) follows Eq. (2)
Classical test theory adopts a deterministic approach
(certainty) wherein the main focus of the analysis is the total (2)
individual score (X). Each test has an error (E) that
accompanies each measurement result in measuring human information:
nature. Pure scores (T) and errors (E) are both latent variables, p: level of difficulty of the test item
but the purpose of testing is to conclude individual absolute nB: number of subjects answering correctly
scores. The score of each item can also be ascertained right n: total number of subjects
and wrong, i.e., for example, if someone's answer is correct,
then given a score of 1 and given a score of 0. While IRT The mathematical model states that the level of difficulty of
focuses on the probability of answering each item where the problem (p) is influenced by the number of participants
assessing solutions is not on someone's total score but who worked on the questions correctly divided by the total
considers one's response/answer at the level of the question. number of participants present.
Giving a rating is also not by determining a score of 1 or 0, but As stated by Allen & Yen, the problem of good things is
the probability of the person getting a score of 1 or a score of from 0.3 to 0.7. Items with difficulty levels below 0.3 are
0. The mathematical formula is called CTT Model is considered heavy items, whereas if the index is above 0.7,
represented in Eq. (1) items are deemed secure [7]. Thus the difficulty level (P)
X =T + E (1) criteria can be written in TABLE I.
This assumption states that the relationship between visible TABLE I. INTERPRETATION OF ITEM DIFFICULTY INDEX
scores (X), pure scores (T), and measurement errors (E) is Difficulty of Index Interpretation
additive. The visible score (X) obtained by an individual is an P ≤ 0.30 Difficult
accumulation of absolute ratings (T) and measurement error 0.31 ≤ P ≤ 0.70 Moderately difficult
(E). P> 0.70 Relatively easy
Margina 49,67,107,11 17,46,53,59,60,62 16,21,22,26, 26 After being analyzed based on two methods in which the
l 0.20 ≤ 3,118 ,77,79,101,110 29,30,31,50, classical theory of the model produced 39 items with poor
D ≤ 0.29 56,58,76,
discrimination index category (≤ 0.19) while using the Rash
Poor ≤ 3,4,6,42,45,4 52,55,2,5,7,8,9,10 1,11,15,18,3 55 model theory, 68 items did not meet the criteria. Even from
0.19 7,69,73,97,9 ,34,36,41,61,85,9 5,39,44,51,1 the second approach, the model produced 28 things in
9,103,116,11 1,98,111,112,117, 2,13,14,19,3 common as problem items. These results indicate that more
8,119,68,83, 70,94,100,102 7 details are recommended to be deleted by Rash than CCT can
106,120
be linked to the procedure followed by two frameworks in
C. Item Fit Level determining the quality of test items.
According to [16], the value of means-square outfit, z- Whereas CTT relies on two parameters namely item
standard outfit, and point measure correlation are the criteria difficulty and discrimination, Rasch is not limited to item
used to see the level of conformity of items. If there are items parameters by adding Discrimination Power and Item Fit all
that do not meet the criteria, then the item should be repaired contribute to the valuation of things that are not appropriate,
or replaced. Guidelines for assessing item conformity criteria Example item 93 was identified by Rasch and CTT as delicate
according to Boone et al. (2014) are as follows. items, but CTT classifies it as Excellent items because the
▪ Accepted Outfit Mean Square (MNSQ) value: 0.5 discrimination index ignores the level of difficulty.
<MNSQ <1.5 Looking at the results, some items that were identified as
▪ Received Z-standard (ZSTD) outfit values: -2.0 <ZSTD items that were not appropriate by the CTT were classified as
<+2.0 necessary by providing more detailed information based on the
ability of the participants. While a participant's expertise in
CTT is determined based on a raw (total) score on the test,
Rasch's interpretation of the participant's ability is based on
the participant's response to delicate and natural items. In 3. Hidayati, K., Keakuratan Hasil Analisis Butir Menurut Teori Tes
Klasik dan Teori Respons Butir Ditinjau dari Ukuran Sampel. 2002.
CTT, students with the same total score will be interpreted to
4. Wahyuni, K.M.d.S., Analisis Kemampuan Peserta Didik Dengan
have the same capacity. Model Rasch in Seminar Nasional Evaluasi Pendidikan. 2014
However, in IRT students with the same total score will be Semarang. p. 9.
interpreted to have different abilities, if one scores more on a 5. Waller, S.P.R.a.N.G., Item Response Theory, and Clinical
Measurement. The Annual Review of Clinical Psychology 2009. : p.
more natural item and another score on a delicate question.
27–48.
Students who print more difficult questions will be interpreted 6. Petrillo, J., Cano, S. J., McLeod, L. D., Coon, C. D., Using classical
to have higher abilities. Whereas the CTT difficulty score of test theory, item response theory, and Rasch measurement theory to
the item indicates how difficult or easy the subject is in the test evaluate patient-reported outcome measures: a comparison of worked
examples. Value Health, 2015. 18(1): p. 25-34.
for the group of examinees, the Rasch measurement provides a
7. Afraa Musa, S.S., Abdelmoniem Elmardi, Ammar Ahmed, Item
better interpretation of the spread of the item difficulty difficulty & item discrimination as quality indicators of physiology
concerning the test participant's level of ability. Rasch made MCQ examinations at the Faculty of Medicine Khartoum University.
this feasible through mapping facilities [18] Khartoum Medical Journal, 2018. Vol. 11: p. 1477 - 1486.
8. Risa Syukrinda, W.R., Scoring on Multiple-Choice Test and
VI. CONCLUSION Achievement Motivation on Geography Learning Outcomes. American
Journal of Educational Research, 2016. Vol. 4 No. 15.
The main objective of this study is to provide empirical 9. Drost, E., Validity, and Reliability in Social Science Research.
evidence of the validity of the construct as well as the Education Research and Perspectives, 2011. 38: p. 105-124.
10. Setyawarno, D., Penggunaan AplikasiSoftware Iteman (Item and Test
reliability of the Student Entrance Examination Test Analysis) untuk Analisis Butir Soal Pilihan Ganda Berdasarkan Teori
developed for State Universities using the traditional Tes Klasik. Ilmu Fisika dan Pembelajarannya, 2017. 1.
Classical Test Theory and the Rasch Measurement Model 11. Jinnie Shin, Q.G.a.M.J.G., Multiple-Choice Item Distractor
(R.M.M.). More important is to identify the suitability / Development Using Topic Modeling Approaches. Frontiers in
Psychology, 2019. 10.
inappropriate or good or bad items that will be maintained or 12. Andri Syawaludin, Y.S.a.W.R., RASCH Model Application for
eliminated from the test when the two CTT and RMM Validation of Measurement Instruments of Student Nationalism.
frameworks are used and then identify the strengths and or International Conference on Education, 2019. Vol. 5(Issue 2): p. 26-
weaknesses of each of the two approaches in test 42.
13. Mahmud, J., Item response theory: A basic concept. Academic
development and validation. journals, 2017. 12(5): p. 258-266.
The findings of this study indicate that more items 14. Tobore, M.A.-i., & Prof. Andrew Igho Joe, Development and
recommended for removal by Rash than CTT might be Standardization of Adolescents' Social Anxiety Scale Using the One-
related to the technique followed by two approaches in Parameter Logistic Model of Item Response Theory. International
Journal of Innovative Social & Science Education Research, 2018. 6:
determining the features of test items. Whereas CTT depends p. 70-76.
on two parameters of item difficulty and discrimination, 15. Magno, C., Demonstrating the Difference between Classical Test
Rasch is not limited to item parameters; besides item Theory and Item Response Theory Using Derived Test Data. The
parameters, reliability of people, item maps, fit statistics, and International Journal of Educational and Psychological Assessment,
2009. 1(1): p. 1-11.
bullies all contribute to the assessment of item 16. William J.Boone , J.R.S., Melissa S.Yale, Rasch Analysis in the
incompatibility. Based on these findings, the selection of Human Sciences. 2014, New York London: Springer Dordrecht
psychometric procedures depends on many elements. Still, Heidelberg.
the interpretation using Rasch provides more detailed 17. Widhiarso, B.S.W., Aplikasi Model Rasch Untuk Penelitian Ilmu
Sosial. 2014.
information about the structure of the items needed for a valid 18. Adibah Binti Abd Latif, NFMA, Wilfredo Herrera Libunao,Ibnatul
assessment of the ability and suitability of students of the Jalilah Yusof and Siti Sarah Yusri Multiple-choice items analysis
things to measure the desired results. using classical test theory and rasch measurement model. Man in
India, 2016. 96: p. 173-181.
ACKNOWLEDGMENT
The authors would like to sincerely thank Institut
Teknologi Sepuluh Nopember, the Directorate of Higher
Education, Indonesian Ministry of Education and Culture, and
LPDP through RISPRO Invitation Program for funding the
research.
REFERENCES
1. Ado Abdu Bichi, R.T., Noor Azean Atan, Halijah Ibrahim, Sanitah
Mohd Yusof, Validation of a developed university placement test
using classical test theory and Rasch measurement approach.
International Journal Of Advanced And Applied Sciences, 2019. 6(6):
p. 22-29.
2. Ado Abdu Bichi, R.T., Rahimah Embong, Hasnah Binti Mohamed,
Mohd Sani Ismail, Abdallah Ibrahim, Rasch-Based Objective Standard
Setting for University Placement Test. Eurasian Journal of Educational
Research, 2019. 19(84): p. 1-14.