0% found this document useful (0 votes)
43 views5 pages

Prediction Dropout or Academic Success

This document discusses text data preparation in RapidMiner for short free text answers in assisted assessment. It describes how data is loaded and transformed, including selecting attributes and examples. It then discusses various data processing techniques used, including tokenization, filtering stopwords, stemming words, and finding word synonyms to increase learning from text data. The goal is to prepare text data for machine learning models to automate grading of short text answers.

Uploaded by

nur ashfaraliana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views5 pages

Prediction Dropout or Academic Success

This document discusses text data preparation in RapidMiner for short free text answers in assisted assessment. It describes how data is loaded and transformed, including selecting attributes and examples. It then discusses various data processing techniques used, including tokenization, filtering stopwords, stemming words, and finding word synonyms to increase learning from text data. The goal is to prepare text data for machine learning models to automate grading of short text answers.

Uploaded by

nur ashfaraliana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/332430141

Text Data Preparation in RapidMiner for Short Free Text Answer in Assisted
Assessment

Conference Paper · November 2018


DOI: 10.1109/ICSIMA.2018.8688806

CITATIONS READS
3 2,769

3 authors:

Tiliza Awang Mat Adidah Lajis


Malaysian Institute of Technology University of Kuala Lumpur
2 PUBLICATIONS   4 CITATIONS    26 PUBLICATIONS   217 CITATIONS   

SEE PROFILE SEE PROFILE

Haidawati Nasir
University of Kuala Lumpur
40 PUBLICATIONS   357 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Smartphone, IoT, Arduino, Raspberry Pi applications View project

Assessment of Learner's Understanding View project

All content following this page was uploaded by Adidah Lajis on 24 July 2019.

The user has requested enhancement of the downloaded file.


Text Data Preparation in RapidMiner for Short Free
Text Answer in Assisted Assessment
Tiliza Awang Mat, Adidah Lajis, Haidawati Nasir
Malaysian Institute of Information Technology, Universiti Kuala Lumpur
Kuala Lumpur, Malaysia
{tiliza, adidahl, haidawati}@unikl.edu.my

Abstract—Data preparation is the most crucial stage before any to linguistic dimensions such as grammatical correctness or
machine learning algorithm can be applied. Equally important organizational structure [4].
is text data preparation for many areas such as information re- This paper is organized to explain the data preparation that
trieval, data mining, information extraction and natural language
processing. In this paper, how text data preparation for short-text is divided into two sub processes wherein the first process
automated assessment grading is implemented in RapidMiner is data loading and transformation, and the second is data
will be discovered. RapidMiner has becoming a very popular tool processing.
in field of data exploration, data preparation, modelling, scoring,
automation and many other data science related. It is found that II. DATA P REPARATION
RapidMiner graphical visual environment makes it convenient, Text Data preparation for machine learning can be chal-
simple and fast to design a better model for automated grading.
With this, an experiment is carried out to explore this option lenging due to the fact that the nature of human language is
on short-text automated grading. The popularity of studying ambiguous and inconsistent in its syntax and semantics. Thus,
automated grading in teaching and learning has also becoming many approaches were explored and experimented in order
a popular area ever since it was introduced. The main challenge to suffice the need in natural language processing. Figure 1
in studying this area is to extract important information out of a shows the overview of the general process of data preparation.
collection of unstructured text-based data. Thus, this paper will
describe in detail how text data can be prepared in RapidMiner The subsections follow are common approaches to text data
tool in order to discover relevant information for machine preparation for machine learning problem.
learning implementation.
Index Terms—RapidMiner Text Processing; Short-Text Auto- A. Data Loading and Transformation
mated Grading; Tokenization; WordNet Synonyms To start with, tabulated data in the form of excel format, as
shown in Figure 2, is first added to the RapidMiner repository.
The data table which is called ExampleSet in RapidMiner
I. I NTRODUCTION
contains the raw text data as well as other important attributes.
Automated assessment grading has become a popular trend Extract, Transform, Load (ETL) process takes place after that.
and has been debated among many researches [1] on how these This process prepares the data for machine learning through
techniques could supplement human judges in the grading various operators. The required operators completely depend
process. Furthermore the introduction of Massive Open Online on the kind and state of the imported or loaded data. Figure 3
Courses (MOOCs) in 2008 that provides automated grading as shows the process in ETL. The Retrieve operator will retrieve
their important features [2] make it relevant and critical to put the data in RapidMiner Repository. The parameter for the
more effort in studying the area. operator is the location of the excel file in the repository
Short text answer is usually a common student assessment where it is stored. Then Filter Examples is used to select which
process in higher education especially in technical domain. Examples of an ExampleSet are kept and which Examples are
Our primary objective is to conduct the feasibility and effec- removed. Numerical to Polynomial operator is only used if
tiveness of machine learning in automated grading for short the need to change the type of selected numeric attributes to a
text answer that has no more than 150 words in length using
a data mining tool. However, in this paper our focus is to
present the first part of the study that is data preparation for
machine learning using the open-source data mining software,
RapidMiner, voted as the top 10 best data mining tools in
2014 [3]. RapidMiner allows various stages of text-mining pro-
cesses such as information retrieval (IR) and natural language
processing (NLP) to be combined into its workflow. NLP is
considerable as a substantial technique in this study as it is
being used to automate the scoring of student texts with respect Fig. 1. Data Preparation Overview
Fig. 4. Data Processing

Fig. 2. Raw Data

Fig. 5. Tokenizing Process

Fig. 3. Extract, Transform, Load (ETL) will leave the result in tokens consisting of one single word.
Aggregate Token Length operator is used to count the length
of tokens for the purpose of filtering ExampleSet to tokens of
polynominal type. It also maps all values of these attributes to
no more than 150 tokens. The next process is to filter English
corresponding polynominal values. Other options also avail-
stopwords from a document using Filter Stopwords (English)
able if more sophisticated normalization method needed, the
operator. This process removes a commonly used words such
Discretization operators can be used. Select Attributes operator
as the, and, to that would appear to be of little value in giving
allows to select or remove a subset of Attributes of an
intelligent pattern or information. For stemming process, there
ExampleSet. Lastly in the ETL process, Set Role is used
are many options available to use such as Snowball, Porter,
to change the role of one or more Attributes. Examples is
Lovins, as well as for German and Arabic language. We have
considered clean after the ETL process and ready for text data
used Porter stemmer to remove inflection from words. This
processing.
process reduce words to their stem. For example the word have
B. Data Processing stems to hav which allows it to be matched with its same stem
Figure 4 shows the first layer of data processing. The having.
cleaned data from II-A is used as input to the text prepro- To increase learning from text the usage of word synonyms
cessing. The polynomial data earlier need to be converted into is explored. With this, word or phrase that means exactly or
text using Nominal to Text operator. This operator changes nearly the same is generated. The Find Synonyms Wordnet
the type of selected nominal attributes to text. It also maps all operator is an interesting process. With this, nouns, verbs,
values of these attributes to corresponding string values. The adjectives and adverbs are grouped into sets of cognitive syn-
output ExampleSet is loaded into Process Documents from onyms (synsets), each expressing a distinct concept. Synsets
Data. This operator generates numeric values corresponding are interlinked by means of conceptual-semantic and lexical
to each word in a data collection according to selected vector relations. WordNet superficially resembles a thesaurus, in that
creation scheme. There are four types of vector creation it groups words together based on their meanings [9] makes it
schemes to choose from. For this study, Term Frequency a useful tool for computational linguistics and natural language
Inverted Document Frequency (TF-IDF) was used. This vector processing. To successfully finding word synonyms Open
creation type is the most useful and very popular [5] and Wordnet Dictionary operator is used to connect to the WordNet
widely used in automated grading algorithm [6] [7] [8]. The dictionary. This dictionary is freely and publicly available for
result of this process is numerical values corresponding to each download from WordNet website [10]. The output of finding
term appearing in the ExampleSet. The second layer of text synonyms process is again normalize using Transform Cases.
processing is done by double-clicking the Process Documents This step transform cases of characters in a document into
from Data as shown in Figure 5. The first thing to do is to use either lower case or upper case respectively. Transform case
Tokenize operator. This operator basically split the stream of is necessary in order to avoid confusion between similar words
text into a sequence of tokens and at the same time throwing that differ in lowercase or uppercase [11]. The doc output port
away certain characters, such as punctuation. There are several from Transform Cases is interconnected to the first layer of
ways of specifying the splitting points. For this paper the the process through the output node of Process Documents
default settings mode, the non-letter character is used. This from Data.
III. P RELIMINARY E XPERIMENT
Upon clicking the execution button the result is displayed
as shown in Figure 6. The list of tokens and their synonyms
as well as the numeric format for each word occurrence
is displayed. This results will become the input for other
processes such as clustering, classification, as well as senti-
ment analysis. To get a rough idea of how the outcome can
be implemented in machine learning problem, a preliminary Fig. 7. Keras Sequential Model
experiment was conducted with Keras sequential model [12].
Keras is a high level neural network API, supporting popular
deep learning libraries like Tensorflow, Microsoft Cognitive
Toolkit, and Theano. Figure 7 shows the first level of Keras
Model implementation. Preprocessed text data is split using
Split Data operator. This operator allow us to specify the Fig. 8. Keras Model Inner Layers
number of partitions and the relative ratios of each partition.
The ratios should be between 0 and 1 and the sum of all ratios
should be 1.
The output port of Split Data then is connected to input port
of Keras Model operator. This operator provides a set of pa-
rameters such as input shape, loss function, optimizer, learning
rate, number of epochs and other parameters for initializing
and compiling a sequential Keras model. Layers chosen in the
neural network architecture are inner operators as shown in
8. Two Keras Dense layers with relu and softmax activation
functions are implemented. Apply Keras Model operator is
connected from Keras Model output port for the purpose of
this classification problem. Finally, Performance operator is
used for statistical performance evaluation of classification
tasks. This operator delivers a list of performance criteria
values of the classification task.

IV. R ESULTS AND A NALYSIS


Running the Keras sequential model produces the output as
in Figure 9. With only a limited samples of 150, the accuracy
of 56.67% has been achieved. This outcome gives enough
reason to explore more deeply and diving deeper into both
text data representations and machine learning options.

Fig. 9. Prediction Results

V. C ONCLUSION AND F UTURE A PPLICATION


In this paper the use of RapidMiner was explored for text
data preparation towards short-text automated grading. It is
found that RapidMiner graphical visual environment makes it
convenient, simple and fast to design a better model for auto-
mated grading. Thus, further exploration will be conducted to
investigate and determine the most suitable machine learning
algorithm that can accurately predict the short text assessment
Fig. 6. Word List and their Frequency Count that altogether will help to reduce the burden of frequent
assessment via short free text answers.
ACKNOWLEDGMENT
This research is supported by the Fundamental Research
Grants Scheme (FRGS/1/2015/ICT02/UNIKL/02/2) financed
by Ministry of Higher Education Malaysia.
R EFERENCES
[1] M. A. Hearst, “The debate on automated essay grading,” IEEE Intelligent
Systems and their Applications, vol. 15, no. 5, pp. 22–37, 2000.
[2] S. Zhao, Y. Zhang, X. Xiong, A. Botelho, and N. Heffernan, “A memory-
augmented neural model for automated grading,” in Proceedings of the
Fourth (2017) ACM Conference on Learning@ Scale. ACM, 2017, pp.
189–192.
[3] “KDnuggets 15th Annual Analytics, Data Mining, Data
Science Software Poll: RapidMiner Continues To Lead.”
[Online]. Available: https://www.kdnuggets.com/2014/06/kdnuggets-
annual-software-poll-rapidminer-continues-lead.html
[4] D. J. Litman, “Natural language processing for enhancing teaching and
learning.” in AAAI, 2016, pp. 4170–4176.
[5] L.-P. Jing, H.-K. Huang, and H.-B. Shi, “Improved feature selection
approach tfidf in text mining,” in Machine Learning and Cybernetics,
2002. Proceedings. 2002 International Conference on, vol. 2. IEEE,
2002, pp. 944–946.
[6] H. Nguyen and L. Dery, “Neural networks for automated essay grading.”
[7] S. Basu, C. Jacobs, and L. Vanderwende, “Powergrading: a clustering ap-
proach to amplify human effort for short answer grading,” Transactions
of the Association for Computational Linguistics, vol. 1, pp. 391–402,
2013.
[8] L. Bin, L. Jun, Y. Jian-Min, and Z. Qiao-Ming, “Automated essay
scoring using the knn algorithm,” in Computer Science and Software
Engineering, 2008 International Conference on, vol. 1. IEEE, 2008,
pp. 735–738.
[9] “WordNet — A Lexical Database for English,” 2010. [Online].
Available: https://wordnet.princeton.edu/
[10] “Current Version — WordNet,” 2010. [Online]. Available:
https://wordnet.princeton.edu/download/current-version
[11] G. Gupta and S. Malhotra, “Text documents tokenization for word
frequency count using rapid miner (taking resume as an example),” Int.
J. Comput. Appl, pp. 0975–8887, 2015.
[12] “Guide to the Sequential model - Keras Documentation.” [Online].
Available: https://keras.io/getting-started/sequential-model-guide/

View publication stats

You might also like