Untitled
Untitled
Untitled
NIM : 22178011
Subject : Language Teaching Evaluation
In terms of measuring and evaluating student performance, there are three theories
to choose from. The CTT is the first and most commonly utilized, the IRT is the
second and most notable in recent years, and Generalizability Theory is the final
but not least (GT). The CTT is typically used to analyze test results in language
schools, and estimates of students' language proficiency are mostly based on those
performance scores.
It should also serve as a reminder that in addition to the test item parameters, other
important factors to consider include the language level and others. In order to test
students' language skills in a language teaching program with more than 3000
students enrolled in two Turkish state universities, the current research examined
the dimensionality of language scores obtained from two different tests designed
to test language learners of two different language levels (low-int and
intermediate). These tests were administered in the 2018–19 academic year.
To determine if the CTT and the IRT provide significantly different estimates of
language test scores for language learners, the analysis pertaining to those
estimates will be looked at. Without a thorough analysis that takes into account
some item parameters, it would be impossible to tell if each item on the language
tests used in the study has the same weighting for scores or if the objected
measurement approaches (CTT or IRT) would significantly affect how well
students performed.
Literature Review
Studies comparing the CTT and IRT have shown results about caregiving. From
the oldest to the most recent discoveries, they were presented. The scoring system
based on the IRT was shown to be more predictive for both men and females than
the CTT when Young (1991) first compared it to the traditional cumulative score
computation technique (CTT) for predicting the talents of male and female
students. Next, Gelbal (1994) contrasted the ability parameters derived from the
IRT Rasch model with the CTT based on the results of accomplishment tests
created for Turkish and mathematics classes in the fifth grade in primary schools.
The results showed a strong positive association between the ability estimations
derived from the two theories, and he came to the conclusion that this relationship
got stronger as the students' ability levels got stronger. One theory could not be
favoured over the others in various assessment and evaluation instances,
according to a study by Fan (1998) that found no statistically significant
difference between the distinctive measures of the IRT and the CTT. Last but not
least, Akyildiz and Sahin (2017) contrasted a faculty accomplishment exam for
open education that was based on the CTT and IRT. They claimed that the results
fit the multi-dimensional logistics model more closely.
In conclusion, the results of the literature research have demonstrated that it is not
feasible to speak about a certain conclusion about the connections between the
IRT and the CTT in terms of superiority. While some research revealed a strong
link, others revealed a disparity between those two methods of assessment. It was
recalled, nonetheless, that practically all researchers advised using the IRT when it
came to item selection since it shows the accomplishment difference better than
the CTT. Additionally, after seeing the findings of the current study, they can
adopt alternate test creation approaches and contrast them with the ones they
already employ. Finally, this study is valuable to scholars since it explored
multidimensionality and provided an analysis of how to estimate language
proficiency for extensive testing.
Methodology
To determine which assessment model best matches the test data acquired from
language learners, this study was created as exploratory research. Exploratory
investigations look into these matters, which are hazy and perplexing to scientists.
The CTT was used to assess language learners' skills with two multiple-choice
language proficiency tests, and the same analysis was done on the basis of the IRT
to compare its results with the CTT. The researcher uses the conventional method
and evaluates its efficacy by comparing it with an alternative method. The Spring
Term of the school year 2018-2019 utilised results from the low intermediate and
intermediate 2032 learners' language achievement and competency examinations.
The research information included in the analysis came from a Turkish language
school that caters to two different state institutions.
The researcher recorded it on two distinct memory cards. The SPSS 26 data
analysis application was used to code and group the students' test results
afterwards. First, 2097 pupils' data were pulled from the database and thoroughly
reviewed. The test results for 65 students were removed from the data set because
some missing values were included in the score sets. As a result, the test data from
2032 students was deemed adequate and divided into two independent score sets
(end-course exam and proficiency test). The original form of each item was then
re-coded in a 1-0 matrix with the relevant answer keys for each test's booklet
codes. Akyildiz and Sahin reported a similar re-coding application on
accomplishment scores (2017). Finally, by comparing and contrasting the total
scores of each participant with the total scores stated in the original database score
set, the correctness and dependability of the data were checked and reported.
Data Analysis
Students' raw test results were examined utilizing information from proficiency
tests and end-of-course exams to compare the CTT and IRT models and come to a
conclusion. The IRT model, whose model fit was greater, was then used to drive
ability estimations together with the IRTPRO program. The correlation
coefficients of two various testing models were calculated in order to determine
whether or not these ability estimations varied considerably. In order to get an
answer to the final study question, correlation coefficient tests relating the test
booklet type were used to look at the correlations between the item parameter
estimations acquired according to both the CTT and the IRT. So it was looked
into if the booklet type affected the item parameters in both testing models
significantly or not. The item parameters are:
1. Item discrimination,
2. item difficulty,
3. Probability of answering the question by chance, which were most
highlighted based on IRT, were obtained respectively from the data set.
Findings
When the test items are separated into two groups, the DIMTEST determines if
there is a substantial difference between the two groups of items in terms of test
correlation. The study's first set of items to be examined included those having a
stronger correlation than the other group. The remaining items were grouped
together in a second group. If there is a substantial difference between these two
item groups, it may be determined using the DIMTEST. The outcome of the
DIMTEST analysis, the T value, and its importance are highlighted. Assuming
that the measuring instrument contains more than one dimension, it is determined
that there is a substantial difference between the two sets of items if the T value is
significant. In this findings, researcher provides data in form of tables.
1. DIMTEST results
2. NOHARM Dimensionality Analysis
3. IRT Data-Model Fit Coefficients
4. IRT and CTT correlation coefficients according to language test scores
5. IRT and CTT correlation coefficients according to different test booklets
6. Correlation coefficients between the IRT and CTT according to language
levels
According to the findings of the DIMTEST study, the proficiency test results were
multidimensional whereas the end-of-course test results were unidimensional.
Thus, it was accepted that the IRT should be used to analyze the proficiency exam
data set. Although the DIMTEST is marketed as a more effective dimensionality
analysis technique than other traditional dimensionality techniques like
Exploratory Factor Analysis (EFA), it essentially uses a linear approach to
analyze potential relationships between items and student skills. Because of this,
the researcher used the NOHARM test, which enables more appropriate analysis
to the IRT's main objective. Dimensionality analyses were repeated in light of the
possibility that there was a curvilinear correlation between students' responses to
various questions and their language abilities.
The end-course test was multidimensional and the correlation level varied
between 0.850 and 0.877, according to the findings, which also revealed that the
correlation coefficients calculated between the ability measures according to the
CTT and IRT taking into account the participants' language levels were
statistically significant (p01). As a result, it was determined that, depending on the
estimate technique, the ability measures of the intermediate students in the end-
course exam exhibited only slight variations. The test was found to be
multidimensional and the correlation level varied between 0.831 and 0.853. All of
these correlation coefficients were statistically significant (p.01) and calculated
between the ability measures according to the CTT and IRT taking into account
the participants' language levels. Low-int language level showed the relatively
larger correlation coefficient for both the first and second aspects.
It was determined that the competency exam, which more than 2000 students who
were learning English as a second language took, was multidimensional in nature.
This conclusion should be taken into account by language school test teams, and
IRT-based assessment and evaluation models should be created to more
effectively measure students' language abilities. The importance of proficiency
assessments in determining students' real language competency is emphasized by
Bachman (2004). Whether or whether the inaccuracy leads the student to pass or
fail, it might have a negative impact on their entire life and cause additional
issues.
The analyses then showed a positive correlation between the total scores of the
end-course and proficiency exams, which were calculated by allocating 1 point to
each item in accordance with the CTT, and the language skill measures estimated
by the IRT approach, which assigned scores to each question with different
values, taking the student's ability, the item's difficulty, the item's discrimination
quality, and the likelihood that the question would be answered correctly by
chance, into account. This result is consistent with MacDonald and Paunonen's
findings (2002). In light of this fact, they recommended using the IRT,
particularly when creating item sets because the IRT allows for the detection of
dimensionality.
There was a positive and statistically significant high score similarity between the
IRT and CTT, as shown by strong correlation coefficients. While compared to the
test items when taking into account the second dimension in language tests, this
similarity was a little bit greater than the first dimension. This outcome showed
that there is a little variation in the measurement outcomes between the IRT and
CTT. These associations were strong, but they do not imply that a different theory
could readily replace the way a specific institution assesses students' progress.
Akyildiz and Ahin (2017) compared the results of the CTT and IRT in order to
highlight this finding, and they found that the scores fit the multi-dimensional
logistics model better. However, these ability estimates were not comparable, and
administrators should not favor the IRT or CTT interchangeably. As a result, it
may be inferred that only one of these two methods should be chosen by
institutions, and that all language tests should be prepared for, carried out, and
assessed in line with the chosen method. Since there was a positive and highly
significant connection between test results, testing theories, and the students'
language levels when looking at the mean scores, this finding may be construed as
the evidence of the high validity and reliability degrees of both language tests.