Untitled

Name : Khairun Nisa Simanjuntak
NIM : 22178011
Subject : Language Teaching Evaluation
Comparison Of Performance Measures Obtained From Foreign Language

Tests According To Item Response Theory Vs Classical Test Theory
Statement of the Problem
In terms of measuring and evaluating student performance, there are three theories
to choose from. The CTT is the first and most commonly utilized, the IRT is the
second and most notable in recent years, and Generalizability Theory is the final
but not least (GT). The CTT is typically used to analyze test results in language
schools, and estimates of students' language proficiency are mostly based on those
performance scores.
It should also serve as a reminder that in addition to the test item parameters, other
important factors to consider include the language level and others. In order to test
students' language skills in a language teaching program with more than 3000
students enrolled in two Turkish state universities, the current research examined
the dimensionality of language scores obtained from two different tests designed
to test language learners of two different language levels (low-int and
intermediate). These tests were administered in the 2018–19 academic year.
To determine if the CTT and the IRT provide significantly different estimates of
language test scores for language learners, the analysis pertaining to those
estimates will be looked at. Without a thorough analysis that takes into account
some item parameters, it would be impossible to tell if each item on the language
tests used in the study has the same weighting for scores or if the objected
measurement approaches (CTT or IRT) would significantly affect how well
students performed.
Literature Review
1. Classical test Theory (CTT)
The common measurement method, or CTT, which is particularly popular

for tests performed by a limited number of students, makes the assumption
that, after evaluating students' skills, it is possible to calculate the test
result using an equation that will undoubtedly contain some measurement
error (Steiger, 2000). The distribution with a mean of 0 and a standard
deviation of 1 is known as the standard normal distribution. Random
mistake and true score have no association to one another. The error sums
of the measurements of two separate characteristics have no link with one
another. In light of this, Hambleton and Swaminathan (1985) came to the
conclusion that the model includes the random error inherent to the
measurement methods since it follows the normal distribution.
2. Item response Theory (IRT)

Particularly in test development studies, IRT was used as an alternative to
the CTT's testing assumptions, and it has since found more and more uses
(Linden & Hambleton, 1997). According to Henning et al. (1985), the
implicit trait theory (IRT) is based on two fundamental tenets: (1) A
person's performance on a test item can be predicted; and (2) The
relationship between a person's performance on a test item and the feature
that enables him to respond to that item (item determination) can be shown
with a curve called the item characteristic curve. In addition, some of the
superior features the IRT also offers to researchers can be listed as
follows:
1) Able to define different point values for each question, considering
item characteristics.
2) Able to predict at which cognitive level each question will be
answered correctly.
3) Able to subtract the probability of giving a correct answer by
chance for each question from the score estimated for that
particular item.
4) Able to report student scores closer to the levels at evenly spaced
scales.
5) Calculating performance measures more accurately than the CTT.
6) Estimating students’ ability regardless of which ability group they
are calculated.
7) Enabling individualized test applications in with computers.
Related Literature Findings
Studies comparing the CTT and IRT have shown results about caregiving. From
the oldest to the most recent discoveries, they were presented. The scoring system
based on the IRT was shown to be more predictive for both men and females than
the CTT when Young (1991) first compared it to the traditional cumulative score
computation technique (CTT) for predicting the talents of male and female
students. Next, Gelbal (1994) contrasted the ability parameters derived from the
IRT Rasch model with the CTT based on the results of accomplishment tests
created for Turkish and mathematics classes in the fifth grade in primary schools.
The results showed a strong positive association between the ability estimations
derived from the two theories, and he came to the conclusion that this relationship
got stronger as the students' ability levels got stronger. One theory could not be
favoured over the others in various assessment and evaluation instances,
according to a study by Fan (1998) that found no statistically significant
difference between the distinctive measures of the IRT and the CTT. Last but not
least, Akyildiz and Sahin (2017) contrasted a faculty accomplishment exam for
open education that was based on the CTT and IRT. They claimed that the results
fit the multi-dimensional logistics model more closely.
In conclusion, the results of the literature research have demonstrated that it is not
feasible to speak about a certain conclusion about the connections between the
IRT and the CTT in terms of superiority. While some research revealed a strong
link, others revealed a disparity between those two methods of assessment. It was
recalled, nonetheless, that practically all researchers advised using the IRT when it
came to item selection since it shows the accomplishment difference better than
the CTT. Additionally, after seeing the findings of the current study, they can
adopt alternate test creation approaches and contrast them with the ones they
already employ. Finally, this study is valuable to scholars since it explored
multidimensionality and provided an analysis of how to estimate language
proficiency for extensive testing.
Methodology
To determine which assessment model best matches the test data acquired from
language learners, this study was created as exploratory research. Exploratory
investigations look into these matters, which are hazy and perplexing to scientists.
The CTT was used to assess language learners' skills with two multiple-choice
language proficiency tests, and the same analysis was done on the basis of the IRT
to compare its results with the CTT. The researcher uses the conventional method
and evaluates its efficacy by comparing it with an alternative method. The Spring
Term of the school year 2018-2019 utilised results from the low intermediate and
intermediate 2032 learners' language achievement and competency examinations.
The research information included in the analysis came from a Turkish language
school that caters to two different state institutions.
Data Collection Process
The researcher recorded it on two distinct memory cards. The SPSS 26 data
analysis application was used to code and group the students' test results
afterwards. First, 2097 pupils' data were pulled from the database and thoroughly
reviewed. The test results for 65 students were removed from the data set because
some missing values were included in the score sets. As a result, the test data from
2032 students was deemed adequate and divided into two independent score sets
(end-course exam and proficiency test). The original form of each item was then
re-coded in a 1-0 matrix with the relevant answer keys for each test's booklet
codes. Akyildiz and Sahin reported a similar re-coding application on
accomplishment scores (2017). Finally, by comparing and contrasting the total
scores of each participant with the total scores stated in the original database score
set, the correctness and dependability of the data were checked and reported.
Data Analysis
The researcher utilized Akyildiz and Sahin's (2017) research as an example in

their investigation. The language score data underwent dimensionality analysis.
Using Stout's nonparametric DIMTEST technique, which he created in 1987, the
dimensionality was examined. Following that, dimensionality analysis was carried
out using the NOHARM test, an advanced dimensionality test that looked at
curved connections (Yavuz & Dogan, 2015). The scores were evaluated using the
IRTPRO to determine the suitable IRT model in order to discover a solution to the
research question taking into account the dimensionality of the test. To establish
how many parameters the data connected to the language scores suit the IRT
model, the test was used (Zwick & Velicer, 1986).
Students' raw test results were examined utilizing information from proficiency
tests and end-of-course exams to compare the CTT and IRT models and come to a
conclusion. The IRT model, whose model fit was greater, was then used to drive
ability estimations together with the IRTPRO program. The correlation
coefficients of two various testing models were calculated in order to determine
whether or not these ability estimations varied considerably. In order to get an
answer to the final study question, correlation coefficient tests relating the test
booklet type were used to look at the correlations between the item parameter
estimations acquired according to both the CTT and the IRT. So it was looked
into if the booklet type affected the item parameters in both testing models
significantly or not. The item parameters are:
1. Item discrimination,
2. item difficulty,
3. Probability of answering the question by chance, which were most
highlighted based on IRT, were obtained respectively from the data set.
Findings
When the test items are separated into two groups, the DIMTEST determines if
there is a substantial difference between the two groups of items in terms of test
correlation. The study's first set of items to be examined included those having a
stronger correlation than the other group. The remaining items were grouped
together in a second group. If there is a substantial difference between these two
item groups, it may be determined using the DIMTEST. The outcome of the
DIMTEST analysis, the T value, and its importance are highlighted. Assuming
that the measuring instrument contains more than one dimension, it is determined
that there is a substantial difference between the two sets of items if the T value is
significant. In this findings, researcher provides data in form of tables.
1. DIMTEST results
2. NOHARM Dimensionality Analysis
3. IRT Data-Model Fit Coefficients
4. IRT and CTT correlation coefficients according to language test scores
5. IRT and CTT correlation coefficients according to different test booklets
6. Correlation coefficients between the IRT and CTT according to language
levels
According to the findings of the DIMTEST study, the proficiency test results were
multidimensional whereas the end-of-course test results were unidimensional.
Thus, it was accepted that the IRT should be used to analyze the proficiency exam
data set. Although the DIMTEST is marketed as a more effective dimensionality
analysis technique than other traditional dimensionality techniques like
Exploratory Factor Analysis (EFA), it essentially uses a linear approach to
analyze potential relationships between items and student skills. Because of this,
the researcher used the NOHARM test, which enables more appropriate analysis
to the IRT's main objective. Dimensionality analyses were repeated in light of the
possibility that there was a curvilinear correlation between students' responses to
various questions and their language abilities.
The end-course test was multidimensional and the correlation level varied
between 0.850 and 0.877, according to the findings, which also revealed that the
correlation coefficients calculated between the ability measures according to the
CTT and IRT taking into account the participants' language levels were
statistically significant (p01). As a result, it was determined that, depending on the
estimate technique, the ability measures of the intermediate students in the end-
course exam exhibited only slight variations. The test was found to be
multidimensional and the correlation level varied between 0.831 and 0.853. All of
these correlation coefficients were statistically significant (p.01) and calculated
between the ability measures according to the CTT and IRT taking into account
the participants' language levels. Low-int language level showed the relatively
larger correlation coefficient for both the first and second aspects.
Discussion & Conclusion
It was determined that the competency exam, which more than 2000 students who
were learning English as a second language took, was multidimensional in nature.
This conclusion should be taken into account by language school test teams, and
IRT-based assessment and evaluation models should be created to more
effectively measure students' language abilities. The importance of proficiency
assessments in determining students' real language competency is emphasized by
Bachman (2004). Whether or whether the inaccuracy leads the student to pass or
fail, it might have a negative impact on their entire life and cause additional
issues.
The analyses then showed a positive correlation between the total scores of the
end-course and proficiency exams, which were calculated by allocating 1 point to
each item in accordance with the CTT, and the language skill measures estimated
by the IRT approach, which assigned scores to each question with different
values, taking the student's ability, the item's difficulty, the item's discrimination
quality, and the likelihood that the question would be answered correctly by
chance, into account. This result is consistent with MacDonald and Paunonen's
findings (2002). In light of this fact, they recommended using the IRT,
particularly when creating item sets because the IRT allows for the detection of
dimensionality.
There was a positive and statistically significant high score similarity between the
IRT and CTT, as shown by strong correlation coefficients. While compared to the
test items when taking into account the second dimension in language tests, this
similarity was a little bit greater than the first dimension. This outcome showed
that there is a little variation in the measurement outcomes between the IRT and
CTT. These associations were strong, but they do not imply that a different theory
could readily replace the way a specific institution assesses students' progress.
Akyildiz and Ahin (2017) compared the results of the CTT and IRT in order to
highlight this finding, and they found that the scores fit the multi-dimensional
logistics model better. However, these ability estimates were not comparable, and
administrators should not favor the IRT or CTT interchangeably. As a result, it
may be inferred that only one of these two methods should be chosen by
institutions, and that all language tests should be prepared for, carried out, and
assessed in line with the chosen method. Since there was a positive and highly
significant connection between test results, testing theories, and the students'
language levels when looking at the mean scores, this finding may be construed as
the evidence of the high validity and reliability degrees of both language tests.

Untitled

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Untitled

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Untitled

Uploaded by

Copyright:

Available Formats

Name : Khairun Nisa Simanjuntak

Comparison Of Performance Measures Obtained From Foreign Language

Statement of the Problem

1. Classical test Theory (CTT)

The common measurement method, or CTT, which is particularly popular

2. Item response Theory (IRT)

Related Literature Findings

Data Collection Process

The researcher utilized Akyildiz and Sahin's (2017) research as an example in

Discussion & Conclusion

You might also like