Keywords

1 Introduction

Peer assessment and peer feedback are key components in the collaborative learning setting [20]. Peer assessment has been shown to increase learning performance [24], and peer feedback can help students reflect on their contributions and identify areas for improvement [26]. The instructor’s participation in orchestrating the peer feedback process is necessary. However, this task is time-consuming and often impossible to keep track of the feedback given by each student in a group in a large-enrolment course (more than 100 students) [45]. Therefore, there is a need to develop tools and methods that would allow for automated feedback analysis to improve teacher support and, ultimately, learning.

Recent studies have applied machine learning methods to automate content analysis in educational settings [7, 13, 14]. More specifically, previous research proposed the development of tools and models to automate content analysis of instructor-provided feedback to support improved learning practices [7, 35]. Nevertheless, to the best of our knowledge, no previous study focused on analysing peer feedback in large enrolment courses aiming to understand how students provide peer assessment/feedback.

Therefore, this paper presents the results of a study that examined machine learning algorithms trained based on a combination of traditional and state-of-the-art natural language processing features for the automated content analysis of peer feedback. The features analysed included (i) TF-IDF, (ii) content-independent features from Linguistic Inquiry Word Count (LIWC, [43]) and Coh-Metrix [16], (iii) sequential content-independent features approach [14] that considers the sequence of the text to extract features, and (iv) BERT linguistic model. This study uses a dataset from a massive introductory course (N = 231 students) at a large engineering university in Sweden. The course was run in a computer-supported collaborative learning setting. The best algorithm evaluated reached Cohen’s \(\kappa \) of 0.43 in the feedback analysis. Further, we report on the results of a detailed analysis of the most predictive features for each category extracted from the CRF model. This information could ultimately increase our understanding of the nature and role of peer feedback in the students’ learning process in a computer-supported collaborative learning setting.

2 Background

2.1 Peer Assessment and Feedback in Computer-Supported Collaborative Learning

Collaborative learning has been shown to relate to student learning performance (e.g., [48]). In CSCL settings, collaborative learning processes are supported by technology [11]. Harnessing the technological affordances and pedagogical strategies, learners are supported in their learning process as a group, knowledge sharing, and co-construction [21]. In their meta-analysis on the effects of key elements (e.g., the role of collaboration, computer use, learning settings and supporting strategies) and in CSCL settings, Chen et al. [8] identified that peer assessment and peer feedback moderately affected knowledge gains, perception, and social interaction in these environments.

In the CSCL setting, peer assessment and feedback are key factors of success [20]. Peer assessment has been found to affect student learning positively [24]. Generally, students are shown to exhibit positive attitudes towards peer assessment (e.g., [38]), and they find that peer assessment practices help them to divide tasks equally [27]. In this context, peer feedback is understood as the communication process between students, providing each other with information to increase learning performance [37]. Peer feedback has often been considered an educational activity for enhancing students’ learning opportunities. Moreover, peer feedback aims to bridge the gap between a student’s current performance and desired level of performance [25]. In higher education, peer assessment and feedback have been found to enhance student learning, academic achievement, and metacognition [40].

Moreover, earlier research on peer feedback has demonstrated that various kinds and levels of specificity of the information provided in feedback lead to different learning outcomes [42]. Among the different types of feedback specificity, scholars argue that supporting comments and suggestions are important since they could provide further understanding of how different types of feedback could promote further improvement to inform future feedback training and practices [10]. Additionally, affective feedback (e.g., in the form of text messages) uses affective language to support or bestow praise or criticism by demonstrating positive or negative feelings about the work [10].

Researchers have also stated that formative feedback examines affective and cognitive feedback practices [33]. In this direction, cognitive feedback involves the content of the work, and the feedback comment could include specifying and explaining aspects of the work under review; it highlights problems, explanations and suggestions and could influence the receiver’s performance in various ways. Others categorized and studied peer feedback in terms of cognitive and metacognitive (i.e., such feedback is evaluative and reflective that helps students develop peer feedback nature [10]), demonstrating that affective feedback is essential. However, it has not been confirmed by the later studies that showed that affective feedback does not necessarily influence post-feedback behavior [15].

Although the benefits of understanding the nature of peer feedback, the task of analysing the whole content generated by students in peer feedback activities becomes challenging and time-consuming for the instructor, especially in large enrolment courses [45]. Extracting the main information from peer feedback is important as it could help instructors to understand the main questions and students’ mistakes, which supports instructors’ feedback and could lead to a change in the course design [45]. Thus, the development of tools to support this process is extremely valuable.

In short, the previous studies listed in this section synthesised the key aspects of peer feedback into five distinct categories: (i) Management, which pertains to the students’ contributions during project development; (ii) Affect, encompassing the affective messages conveyed in the feedback by students; (iii) Interpersonal factors, referring to positive and negative interactions that occur during group work; (iv) Suggestions for improvement, including explicit recommendations for enhancing the proposed activity; and (v) Cognition, covering reflective and evaluative messages. Additionally, a Miscellaneous category was included to capture any messages that did not fit into the other categories.

2.2 Computational Approaches to Automated Content Analysis

Several studies have developed and applied computational approaches for automated content analysis in different educational contexts to automatically analyse large amounts of textual data, including student essays, forum posts, and feedback [7, 13, 14]. In the context of feedback analysis, Lee and Lim [23] propose an automated analysis of feedback about university courses to highlight students’ main concerns with the institution based on their feedback’s key terms. The authors created several graph representations based on the top words in the feedback message. This study has demonstrated that processing and understanding a large amount of unstructured data generated using text analytics is possible.

Recently, several studies have applied natural language processing methods and linguistic tools (e.g., LIWC and Coh-Metrix) for extracting specific categories from feedback messages provided by instructors [7, 35]. The initial study proposed several models to automatically identify the feedback levels proposed in [17] (e.g., task, process, self-regulation and self). The random forest classifier, the best-performing algorithm, reached Cohen’s \(\kappa \) of 0.39 [7]. Osakwe et al. [35] evaluated the same categories, but the authors also used several sampling methods and an ablation study to overcome the limitations in the previous paper. In the best-case scenario, Cohen’s \(\kappa \) increased to 0.42. In both papers, the authors presented the best predictive features.

In addition to traditional machine learning models, deep learning approaches have been proven effective for text classification in educational settings [12]. An example is the work of [3] that evaluated the performance of random forest-based algorithms and the BERT deep learning linguistic model for automatic detection of social presence in online discussions. The authors compare the approach with traditional text mining and linguistic features like LIWC and Coh-metrix with the approach using the fine-tuned BERT language model for social presence classification. The results demonstrate that the XGBoost and AdaBoost (Adaptive Boosting) algorithms outperformed the BERT model in the automated classification of online discussion messages.

Finally, we highlight the research performed by Ferreira et al. [14] that proposed a new approach for automated content analysis of written essays, called sequential content-independent features, using the features from adjacent sentences and not only from the analysed sentences to incorporate information [14]. The authors suggested that this approach could be useful when the classification task is related to a sentence in a large text (e.g., a sentence in a feedback message). They evaluated traditional machine learning algorithms combined with TF-IDF, content-independent, and sequential content-independent features. In this study, the best classifiers’ performance was the XGBoost and CRF models based on the proposed sequential content-independent feature set, demonstrating the potential of using the sequence in the text as a key extracting factor in this type of analysis.

3 Research Questions

The previous studies demonstrate the potential of automated content analysis of educational feedback. This study proposes the evaluation of traditional TF-IDF and content-independent features combined with traditional machine learning algorithms and BERT for the automated content analysis of peer feedback according to the six categories described in Sect. 2.1: Management, Affect, Interpersonal factors, Suggestions for improvement, Cognition, and Miscellaneous. As such, this study aims to answer the following research question:

Research Question 1 (RQ1): To what extent can natural language processing algorithms automatically identify categories of peer feedback?

Although automated content analysis of feedback messages could facilitate the instructors’ interactions with the students in the course [40], the classification per se does not provide insights into the most relevant features. Previous studies demonstrated the potential of further unpacking details about the most important features in the educational settings [7, 14]. Therefore, we utilised the best-performing classifier developed in this study to address the second research question:

Research Question 2 (RQ2): Which features are the most predictive of peer feedback categories?

4 Method

4.1 Course Design and Dataset

In the studied context, students conducted project-based tasks as a part of a design project, performed in groups of four to six students along with seminars and lectures. To support both students in their collaborative learning process as well as teaching assistants and the examiner in the assessment process, a CSCL assessment system (CLASS) has been introduced in the course (for details, see [46]). The CLASS system has three key modules: i) peer assessment, 2) self-reflection, and 3) examiner feedback. Peer assessment and self-reflection tasks were a mandatory part of the individual assessment process during formative and summative assessment steps. More specifically, each student was asked to write anonymous feedback to four-five more peers/students that were constituting the same project group. This was a part of the student individual assessment and contributed to the student grade, both during the formative and summative assessment steps. In the CLASS system context, students could read their peers’ self-reflections, which, together with their personal experiences from the project group work, were used to provide anonymous feedback to their peers in a group of four to six students. Each student needed to formulate eight to 12 pieces of peer feedback throughout the course. The teacher assistant and the examiner read the feedback (anonymous) for each student to assess each student’s contribution to the project work.

In this context, a dataset containing 2,444 feedback messages written in English created by 231 students was collected. These feedback messages were divided into 10,319 sentences, extracted from peer feedback written by students for a university class from Sweden and annotated in different categories. The data annotation followed the exploratory approach proposed by Boyatzis [5], where no predetermined set of categories was adopted. The main idea of this approach is to let the annotators identify categories that surface directly from the text data. The annotation used the sentences as the unit of analysis and followed four steps: (i) two annotators assessed 3% of all feedback messages (n = 120). After this analysis, six categories emerged from this process: (1) ’management’, (2) ’suggestions for improvement’, (3)’ interpersonal factors’, (4) ’cognition’, (5) ’affect’, and (6) ’miscellaneous’. (ii) a sample of 500 individual students’ feedback entries was coded by two annotators separately. This sample aligned well with the recommended size (10% to 25% of the dataset) proposed in [36]. Cohen’s kappa values of this stage achieved 0.65, which suggests a moderate level of agreement [22], which also validated the categories identified in the previous step. (iii) based on the disagreements, the annotators discussed the discrepancies and re-annotated the divergent texts. In this stage, Cohen’s kappa values were calculated at 0.86 (strong level of agreement [22]). (iv) the rest of the dataset was coded independently by the annotators.

Table 1 provides a comprehensive overview of each category, including its rationale, an illustrative example sentence, and the number of instances (sentences) belonging to that category. By examining this table, you can better understand the categories and the criteria used to classify them.

Table 1. Dataset Peer Feedback Categories descriptions

4.2 Features

The natural language processing step has been performed based on TF-IDF features commonly applied in text classification problems [13] and content-independent features based on linguistic resources to classify different students’ texts [7, 13, 31]. Furthermore, Ferreira et al. [14] proposed a new approach, called sequential content-independent features, that demonstrated the potential to classify educational texts. Thus, we defined the following features for this study:

  • TF-IDF: Term Frequency - Inverse Document Frequency (TF-IDF) is a content-based text feature extraction approach commonly used in classification models [28]. It transforms a textual document to an array containing the term counts [28]. We adopted the traditional TF-IDF technique [28].

  • LIWC features: LIWC is a text analysis resource that counts words in psychologically meaningful categories [43]. The distribution of those categories in the text can give insight into the psychological state of its author or can reflect the author’s personal condition [44]. In this study, we extracted a total of 94 LIWC features. In the problem of peer feedback, these features are relevant for two main reasons: they provide structural characteristics of text and features related to emotions which can be useful to identify the sentences [7].

  • Coh-Metrix features: Coh-Metrix is a computational linguistic tool that measures text cohesion and difficulty on a range of word, sentence, paragraph, and discourse dimensions [32]. It is extensively used in the educational field to evaluate the coherence and structure of text (e.g., [7, 14]). In this study, we have extracted a total of 83 Coh-Metrix features.

  • Sequential content-independent features: This feature incorporates the neighbouring features of a sentence, meaning that the features vector of a sentence \(S_i\) contains its own features plus those of the sentences \(S_{i-1}\) and \(S_{i+1}\) when these exist [14]. As this study focuses on analysing individual sentences in a larger text (e.g., feedback message), this approach could potentially improve the final classification as it considers the sequence of sentences in the text [14].

For the initial Content-independent features space used in this study, considering LIWC and Coh-Metrix only, we had 177 features. After incorporating the features from the previous and subsequent sentences, resulting in the sequential content-independent features, the final features vector of each sentence contained 531 features - 3\(\,\times \,\)177 features.

4.3 Model Selection and Evaluation

We trained the evaluated different machine learning classifiers used by previous works [7, 14, 31]: Gaussian Kernal SVM (SVM), Gaussian Naive Bayes (NB), Logistic Regression (LR), K-nearest neighbours (KNN) AdaBoost, XGBoost, Random Forest(RF) and Conditional Random Fields (CRF). The SVM algorithm attempts to find a hyperplane with the maximum distance from the positive and negative examples [2]. In the case of a multi-classification problem, the outcome is a combination of several SVM classifiers [14]. The KNN algorithm finds the k nearest neighbours among the training documents (e.g., sentences in feedback messages) and uses the categories of the k neighbours to weigh the category candidates using the similarity score between each neighbour document and the test document [47]. NB uses the joint probabilities of words and categories to estimate the probabilities of categories given a document (e.g., sentences in the feedback messages) [47]. The Logistic Regression algorithm is used to assess the effects of predictor variables on categorical outcomes [34]. It estimates the probability of an event occurring based on a given data and a set of independent variables [18].

Another family of algorithms evaluated was the decision tree ensembles. Random forest is a technique that combines tree predictors based on the values of a random vector sampled independently from all trees in the forest, and each tree has a unit vote for the most relevant category at a given input [6]. AdaBoost and XGBoost are state-of-the-art decision tree approach [14]. These algorithms use the boosting technique to enhance the performance of individual models. It works by training a sequence of weak models and combining this information to an accurate classification [29]. Lastly, the CRF algorithm creates a graph model to analyse the neighbourhood of the instances in the categorisation process [14].

Finally, we also evaluated the performance of the BERT language model. BERT generates embeddings that vary according to the textual context of each occurrence of a lexicon, which allows capturing variations of meaning [12]. This work uses the pre-trained BERT model provided by the simpletransformers libraryFootnote 1.

To address research question RQ1, we measured the performance of the classifiers using a 10-fold cross-validation sampling approach in combination with two measures widely used in the literature [1, 14]: (I) F1-score is the geometric mean of precision and recall, where Precision measures the percentage of correct instances among the identified positive instances and Recall measures the percentage of correct instances that can be identified among all the positive instances [19]; (ii) Cohen’s \(\kappa \) coefficient is a statistical measure of inter-rater agreement for qualitative items [30].

To tackle research question RQ2, we evaluated the significance of the top 20 features concerning their relevance to predicting the categories examined in the present study (see table 1). We focused on the analysis using the best-performing model (CRF) pinpointed in the evaluation carried out to address the first research question. The Transition Feature Coefficients (TFC) [41], a largely used measure for this goal, was applied to estimate the importance of the individual features for the CRF model.

Table 2. Results for the analysed algorithms in terms of F1 and Cohen’s \(\kappa \).

5 Results

5.1 RQ1: Performance of the Proposed Models

RQ1 aimed to compare the performance of different machine learning algorithm that aimed to identify relevant categories of peer feedback. Table 2 presents the results of the machine learning algorithms introduced in Sect. 4.3, that were trained using content-based features (TF-IDF and BERT), content-independent features, and sequential content-independent features (Sect. 4.2), respectively. The table shows the results of F1 and Cohen’s \(\kappa \) using a 10-fold cross-validation described in 4.3. We also note that, in the case of the BERT classifier, the results based on content-independent and sequential features could not be obtained, as it focused on the analysis of feedback content based on the pre-trained embeddings in the language model of BERT.

The results indicated that the models based on TF-IDF features outperformed those based on content-independent features considering the majority of the classifiers in the analysis. The exceptions were AdaBoost and XGBoost, where the content-independent features reached better results. It is important to highlight that the results for the content-independent and sequential content-independent features were comparable for KNN, LR, RF, and CRF. Finally, using sequential features did not increase the performance of any model assessed. CRF, the best-performing classifier, reached 0.43 (moderate agreement [22]) and 0.35 of Cohen’s \(\kappa \) when applied with TF-IDF and content-independent features, respectively. CRF combined with TF-IDF also performed better in terms of F1. Logistic regression and Random forest classifiers also reached results higher than 0.3 of Cohen’s \(\kappa \), which indicates a fair level of agreement [22].

5.2 RQ2: Feature Importance

To answer research question RQ2, we analysed the most relevant features of the CRF classifier, as this was the best-performing classifier. Moreover, we extract the main features for the TF-IDF and content-independent analysis in order to get different insights. The main goal of this part of the study was to provide insights into the most predictive features and their contributions to each category assessed.

Table 3 presents the top 20 most predictive TF-IDF features for the CRF classifier, the better-performing model in our experiments, ranked based on the TFC index. It also shows the average (and standard deviation) of the features for each category. The table lists the most significant words for the model. In general, the relevant word could be related to the categories extracted (more details in Sect. 6). However, the frequency distribution for many category was not different (including many columns with 0), which diminishes the possible interpretation of the influence of specific features for specific categories.

Table 3. Top-20 most important TF-IDF features and their values for each category using the CRF classifier (see Table 1 for the category names).
Table 4. Top-20 most important Content-Independent features and their values for each category using the CRF classifier (see Table 1 for the category names).

Similarly, Table 4 shows the importance analysis for the CRF in combination with the content-independent features. Again, the features were ranked according to the TFC measure. The key findings of this table are: (i) Coh-Metrix and LIWC have 10 features each within the top-20; (ii) features related to the number of pronouns (cm.WRDPRP2, cm.WRDPRP3p), the incidence of linguistics elements (cm.CNCCaus, cm.WRDNOUN, cm.DRGERUND, cm.CNCAdd) and average/standard deviation measures related to words, sentences, and paragraphs (cm.DESSLd, cm.DESSLd, cm.WRDFRQa, cm.DESWLsyd) were the relevant features for coh-metrix; (iii) for LIWC features related to Personal Concerns (liwc.money, liwc.relig, liwc.death), Cognitive Processes (liwc.discrep, liwc.insight, liwc.compare), Affective Processes (liwc.sad), Biological processes (liwc.ingest), and Spoken categories (liwc.filler) were relevant.

6 Discussion

The results for the automatic categorisation of peer feedback messages (RQ1) revealed that the CRF classifier reached the best performance when applied in combination with TF-IDF and content-independent features. CRF reached 0.43 (TF-IDF) and 0.35 (Content-independent) of Cohen’s \(\kappa \), which represents a moderate and fair level agreement rate [22], respectively. This result aligns with the literature as CRF usually outperforms other models when applied to tasks related to categorising individual sentences within significant texts (e.g., feedback messages) [14, 39]. Moreover, previous studies on the analysis of feedback messages (using different categories) also reached similar results, Cohen’s \(\kappa \) 0.39 [7] and 0.42 [35]. On the other hand, XGBoost did not achieve good values. The unbalanced nature of the dataset could have influenced this outcome [9], but further investigation is required to understand this result better.

Moreover, TF-IDF features generally reached better results. This indicates that the vocabulary (e.g., words) used by students was relevant. The low performance of BERT and the low relevance of features related to vocabulary richness (from Coh-Metrix) suggest that the students did not employ a varied language when providing feedback to their peers, always using similar words [12, 16, 31]. This finding is aligned with the literature as peer feedback usually is designed for specific tasks with focused vocabulary [37, 38, 40]. Finally, it is important to mention that previous studies [14] have shown a lack of generalizability of TF-IDF across different domains, which can impact the performance of models with these features when applied to the same task (e.g., peer feedback analysis), but for different courses.

Our second research question aimed to analyse the important features extracted using the best-performing classification model (CRF) measured based on the Transition Feature Coefficients (TFC) [41]. The results showed that the top-ranked features for TF-IDF included words related to the categories analysed. For instance, the discussion and improvement of the word could be an indication of “suggestion for improvement”; collaboration and design are words related to “management”; good and thanks to “affective”; and idea, creativity, and knowledge to “cognition”. However, it is hard to interpret the differences in the frequency of these words per category. In this sense, previous works [4, 7] have suggested that the analysis of content-independent features could provide more actionable insights.

For example, Table 4 shows that a higher number of second-person pronouns (cm.WRDPRP2) were associated to affect, suggestion for improvement, and miscellaneous categories, while low occurrences of these pronouns were related to management, interpersonal factors and cognition. Furthermore, the LIWC categories were well aligned with the peer feedback category suggested as the features related to the cognitive process (liwc.discrep, liwc.insight, liwc.compare) had higher values for the feedback cognition category; and the number of LIWC personal concerns related words had higher values for the affect category.

Finally, this paper makes important educational contributions in several ways. First, the automatic classification of peer feedback content could assist instructors during peer feedback assessment. Specifically, by utilizing the automatically identified categories and most relevant features, instructors could provide targeted feedback to students on how to improve their messages. For instance, the instructor could recommend the students provide more personal (affective category) or reflective (cognition category) feedback. Additionally, these findings could serve as a foundation for developing technology-supported systems, such as chatbots, that offer timely guidance to students throughout the peer feedback process. For example, the chatbot could send nudges to the students while they write the feedback message suggesting improvement in the content. Furthermore, our research sheds light on key aspects of students’ relationships with their peers, their contributions to group work, and their interactions with others. This information could help reduce the time that human reviewers spend supporting computer-supported collaborative learning (CSCL) interactions. Overall, the contributions of this paper have important implications for improving the effectiveness and efficiency of peer feedback in educational settings.

7 Limitation and Future Directions

We acknowledge the following limitations of the study. First, the data used in the study contained just the categories of the sentences within each feedback generated by a sample of students at one of the Swedish universities. Although this can help instructors to fasten their evaluation by knowing where to focus in each feedback message, this study did not ensure the generalizability of the proposed approach. Future research should aim to evaluate the automated classification of similar categories in diverse datasets, potentially encompassing different languages. Second, the dataset used in the study had a very imbalanced number of instances per class. This could impact the outcome of the evaluated models. Therefore, it is important to evaluate over/under sampling algorithms to optimize the machine learning models used in future research. Finally, this study did not intend to evaluate the integration of the final model with the CLASS [46], from where we extracted the dataset in use. Such integration is a promising line of future work.