Developing and Evaluating A Learner Friendly Collocation System With User Query Data
Developing and Evaluating A Learner Friendly Collocation System With User Query Data
ABSTRACT
Learning collocations is one of the most challenging aspects of language learning as there are literally
hundreds of thousands of possibilities for combining words. Corpus consultation with concordancers
has been recognized in the literature as an established way for language learners to study and explore
collocations at their own pace and in their own time although not without technological and sometimes
cost barriers. This paper describes the development and evaluation of a learner-friendly collocation
consultation system called FlaxLC in a design departure away from the traditional concordancer
interface. Two evaluation studies were conducted to assess the learner-friendliness of the system: a
face-to-face user study to find out how international students in a New Zealand university used the
system to collect collocations of their own interest and a user query analysis—based on an observable
artefact of how online learners actually used the system over the course of one year—to examine how
the system is used in real life to search and retrieve collocations.
Keywords
Collocation Database, Collocation Learning, Corpus-Based Language Learning, Data-Driven Learning, User
Query Data Analysis
INTRODUCTION
Collocations, recurrent word combinations, have been widely recognized as an important aspect
of vocabulary knowledge (Firth, 1957; Lewis 2008; Nattinger & DeCarrico, 1992; Sinclair, 1991,
Nation, 2013). Dictionaries and most recently, corpus analysis tools (i.e. concordancers and the like)
are the two main resources that learners draw on while acquiring such knowledge. Printed dictionaries
are the popular and traditional collocation learning resource—for example, the BBI Combinatory
Dictionary of English (Benson et al., 1984), the Dictionary of Selected Collocations (Hill and Lewis
1997) and the Oxford Collocation Dictionary (2009)—dedicated to assist learners in mastering this
essential knowledge. With corpus analysis tools, learners can enter a word and explore what words
are most likely to occur before or after it. These tools, whether web-based (e.g. the Collins COBUILD
Corpus, WebCorp, WebCollocate, Mark Davies’ Brigham Young Corpora, COCA) or stand-alone
(e.g., WordSmith Tools, AntConC) are specifically designed for linguists, and come with different
interfaces, search functions and the presentation of results that are mostly in the form of keyword-
DOI: 10.4018/IJCALLT.2019040104
Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
53
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
in-context (KWIC) fragments and incomplete sentences. In terms of retrieving collocations, they
facilitate the search for two-or three-word collocations.
Corpus-based tools have been explored by many researchers and teachers to facilitate collocation
learning with promising results as demonstrated in the literature (Boulton & Cobb, 2017). They have
been used in helping students find correct word combinations (e.g. “casue a problem” vs. “bring
a problem”) (Yoon, 2008; Chen, 2011; Daskalovska, 2015; Vyatkina, 2016), understand the subtle
meaning of certain verbs that lack direct L1 equivalents: synonyms (e.g. construct, build, and establish),
hypernyms (e.g. create and compose) (Chan and Liou, 2005), and identify common word choice
errors in student writing (Chambers and O’Sullivan, 2004; Wu, et al. 2009). Johns (1991) used the
term “data-driven learning” (DDL) to describe this approach that centers on fostering learners’ skills
in becoming a “language researcher”. Despite DDL’s great potential as presented in the literature,
DDL has not been widely accepted by mainstream language educators (Leńko-Szymańska & Boulton,
2015, p. 3). Technical challenges that face both teachers and students go some way to explain the
reluctance in implementing DDL in classroom. This view is supported by the results of a large-scale
survey conducted by Tribble on using corpora in language teaching (Tribble, 2015).
User-friendliness and free access are reported to be two major factors in influencing the
willingness of respondents to use corpora, while “don’t know how to”, “are not familiar with” are
among the reasons for not using corpora. DDL researchers have also reported several factors that
may hinder corpus use, including requirements of metalinguistic knowledge (e.g., part-of-speech
tags) to formulate queries, unfamiliarity with complex search interfaces and functions, overwhelming
results, and difficulties in locating and interpreting target language features in concordances (Yoon
& Hirvela, 2004; O’Sullivan & Chambers, 2006; Yeh, Li, & Liou, 2007; Chen, 2011; Rodgers et
al., 2011; Boulton, 2012a; Chang, 2014; Geluso & Yamaguchi, 2014; Daskalovska, 2015). For
example, Chang (2014) asserts that the differing interfaces and functions of various corpus tools
further increases the technical challenge whereby learners generally need to learn a new system in
order to access a different corpus.
To make corpus tools accessible for language learners, Boulton (2012b) published a call to
“simplify the technology as much as possible either the corpus itself or the associated software”.
DDL researchers have also proposed improvement recommendations in corpus software development,
for example, “simplified”, “controlled” or “screened” concordancer output before being presented
to students (Varley, 2009; Daskalovska, 2015), and simple interfaces to return multi-word units,
for example, make a presentation or give a presentation for the search query word presentation
(Chan & Liou, 2005). Chen (2011) evaluated three corpus tools—the Hong Kong Polytechnic Web
Concordancer, the COBUILD Concordancer, and BNC Sample Search tool—and proposed an ideal
collocation retrieval tool, one that needs to include: the position of the search term (before or after
the query word), functions for phrase or multiple word searches, support for using L1 in queries,
syntactic information about the collocates (verb, noun, adjective, adverb), retrieval of all relevant
collocations at once, and the clustering of semantically related collocations. Over the past years, it
has been encouraging to see the development of dedicated tools for collocation learning (e.g. SKELL)
with some of the aforementioned proposed functionalities included.
This paper describes our attempts at building and evaluating a collocation retrieval system called
FlaxLC and answers the following research questions:
To answer the first research question, we proposed two design principles that aim to minimize
learning and training efforts for using a corpus-based tool. The first principle is to capitalize on learners’
familiarity with traditional language resources (i.e. dictionaries); the second is to utilize learners’
existing web skills with search engines. To answer the second research question, we conducted two
54
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
evaluations. First, 32 Chinese postgraduate students at a New Zealand university were divided into
two groups of 16, with one group receiving training in the use of the FlaxLC system and the other
group receiving training in the BYU-BNC1 corpus system to look up and collect collocations specific
to writing task topics. Second, user queries sent to FlaxLC were analysed to examine how the system
was used over the period of one year. The log data analyses that we present in this study, similar to
traditional analyses of user queries on the Web, provide interesting and revealing insights that could
not be gained from small scale focused user studies. To the best of our knowledge, this user query
data analysis approach has not been explored in DDL research.
Design Principles
Dictionaries are arguably the most popular resources in language study, and therefore the most
familiar to learners, which has been evidenced in an iterative and wide-ranging survey by Tribble
into the use, or lack thereof, of corpus tools, including a survey of the most commonly used language
resources placing dictionaries in the top position (Tribble, 2015). We propose that FlaxLC mimics the
structure of a traditional collocation dictionary after studying the different definitions for collocations
in the literature, and investigating the structure, organization, and language items found in traditional
collocation dictionaries. The following two questions were used to guide our software designs:
55
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
is used by the Oxford, BBI, and LTP dictionaries, Ackermann and Chen’s Academic Collocation
List (2013), and is supported by linguists (for instance, Firth, 1957; Sinclair, 1991), and by language
learning researchers (for instance, Nation, 2013; Nattinger & DeCarrico, 1992; Nesselhauf, 2004).
Many DDL researchers (for instance, Chan and Liou, 2005; Yoon, 2008; Wu, et al., 2009, Chen, 2011,
Daskalovska, 2015; Vyatkina, 2016) have also used corpus tools with students to study particular
collocation patterns (e.g., verb + noun, adjective + noun, verb + preposition).
We focused on collocations that contain from two to five continuous words using fourteen
different collocation patterns, eight derived from the research and development of two collocation
dictionaries, BBI Combinatory Dictionary of English (Benson, et, al. 1997) and Oxford Collocation
Dictionary for Students of English (McIntosh, et, al. 2009). We added six ourselves (see Appendix A
for all the patterns and Wu, et, al. 2016 for a detailed description). Some of these types are extended
to include more constituents of potential use to learners. For example, the noun part of a verb + noun
collocation can include a complex noun phrase involving one or more nouns coupled with modifiers
or prepositions: examples are take full advantage of, play an extremely important role. Wei (1999,
p. 4) supports this approach, arguing that it incorporates syntax into a predominantly semantic and
lexical construct, thus encompassing a wide range of data. Collocations containing very common
adverbs like more, much, very, quite are removed from the patterns involving adverbs because they
can accompany most adjectives and verbs.
56
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
with their frequency. These all contain the words economic and benefit, whether adjacent or not;
are of the adjective + noun type; and have up to five words. Many include more than one noun or
adjective, such as direct economic benefit, and significant national economic benefit. Clicking a
collocation retrieves samples in context from the original text: Figure 4 shows six samples for direct
economic benefit.
Collocations are ordered by frequency to draw learners’ attention to the most commonly used
ones. This is achieved in three ways: the most frequent syntactic type of the query word, the most
frequent collocation pattern, and the most frequent collocation. For example, with the query term
benefit, its collocations are first grouped under its noun and verb forms. The noun collocations are
displayed first because they are more frequent than the verb collocations. Within the noun group,
adjective + benefit, noun + benefit, benefit + of + noun, verb + benefit … are presented in order of
frequency, and within each collocation pattern the most frequent collocation is always listed first.
The same applies to the verb group.
Figure 1. Family words, synonyms, related words, definitions, related topics, and collocations associated with the word benefit
57
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
Word Autocomplete
Misspelling is common in search engine queries. Wang, et al. (2013) reported a misspelling rate of
26% for query words on an academic website they examined. Google’s autocomplete facility consults
historical query terms to provide hints while the user is typing. This approach of reusing historical
query terms is not applicable for FlaxLC user queries because learners’ vocabulary is likely to be
small and limited; therefore, the misspelling rate would be high in leaners’ historical query terms.
FlaxLC’s Word Autocomplete function helps users form correct English words by consulting a
specially built dictionary that is comprised of 32,000-word entries extracted from a Wikipedia article
corpus of three-billion words. Words are sorted by frequency; inflected forms of a word (e.g. takes,
taken, taking for the word take) and rare words (i.e. that occur only once in Wikipedia) are omitted to
achieve a good user interface response time. To avoid overwhelming users with too many language
choices, only up to twenty suggestions are given at a time, for example, present, previous, president,
previously, pressure, presence, press, prevent, presidential, preparation, prestigious, predecessor,
premiere, predominantly, pregnant, presentation, present-day, prepare are presented when the letters
“pre” are typed into the search box.
Related Words
The related words function in FlaxLC extends Chen’s (2011) idea of retrieving words that are
semantically related to the query term. It has been designed to help learners expand their word and
collocation knowledge, especially in domain-specific areas, or on topics related to what they are
studying. We explored the possibility of using the publicly available and growing Wikipedia corpus
of articles to present related words and collocations. FlaxLC first finds the best matching Wikipedia
article and then the keywords and collocations of that article are retrieved. The collocations are
then grouped by the keywords they contain. Figure 5 shows the first 40 words related to the query
animal testing, sorted by TF-IDF (term frequency-inverse document frequency) score, including
animal, primates, test, experiments, research, vivisection, etc. This metric, which is commonly used
in information retrieval (called TF-IDF, and described by, for example, Witten et al., 1999), is used
to rank words related to the query, so that they can be displayed in descending order of relatedness.
Clicking one of the hyperlinked related words, say toxicity, reveals collocations associated with that
58
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
word. For example, adjective collocates (acute, general, chronic, embryonic toxicity), verb collocates
(reflect, involve, evaluate toxicity), and noun phrases (toxicity tests, sign of toxicity, toxicity of a
substance). More words can be displayed by clicking the more button. Towards the end of the list
the words become more general: for example, the last group of words related to animal testing are
population, line, end, series, form, play, have, be.
EVALUATIONS
After building the system, evaluations were conducted to investigate the second research question, how
learners use FlaxLC to look up collocations. Unlike other DDL research that primarily examines the
educational effectiveness of corpus consultation in language learning (with many promising results
reported in the literature), we focused on learner interests and behaviors while they are engaging
with the system. A user study was conducted to find out the types of collocations that learners are
interested in, and a user query analysis was carried out to discover user behaviors in interaction with
FlaxLC. This section first looks at a small-scale study with 32 language learners. Following is a
presentation of data analysis of user queries over the period of one year.
59
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
User Study
Participants recruited for the study included 32 Chinese postgraduates, aged 20 to 30, who were
primarily male students, 20 males and 12 females. The participants were divided into two groups, with
one group using FlaxLC and the other using the BYU-BNC. Our primary reasons for conducting a
comparative study with BYU-BNC were two-fold: first, to ascertain whether FlaxLC is easier to use
in terms of retrieving collocations and second, to determine whether language leaners are interested
in employing collocations in their writing that consist of more than two words and which would
normally require more effort to identify using a traditional linguistic tool such as a concordancer.
The study took place as part of a tutorial that provides academic writing training to international
students. The tutorial is intended to introduce FlaxLC to students so that they can use it to look up the
collocations of a word while writing a report. This experiment enabled us to determine what types of
collocations students would be interested in without explicitly instructing them to focus on particular
patterns, for instance, verb + preposition in Vyatkina’s study (2016), verb + noun in Chen’s study
(2011), or verb + adverb in Daskalovska’s study (2015). We did not use SKELL, which is another
dedicated tool for learners that bears the closest resemblance to our system, because traditional
concordancers (Mark Davies’ Brigham Yong Corpora (BYU), Wordsmith Tools, AntConc, etc.) are the
most widely used tools according to Tribble’s survey (2015) and those reported in the DDL literature.
60
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
Method
Both groups undertook the same two collocation retrieval tasks over the period of one hour, at
different time slots, using FlaxLC and BYU respectively. None of the participating students had used
either corpus tool prior to the study. To allow us to effectively compare the results of the two groups,
the FlaxLC group was instructed to use the BNC corpus only (and not the Wikipedia and BAWE
corpora from the drop-down menu options in the FlaxLC) as it is the same corpus used within the
BYU-BNC system. Before completing the collocation collection tasks, participating students from
both groups attended the same tutorial where our definition of collocations was explained, and the
importance of collocation learning was discussed. Instructions and demonstrations on how to use the
tools were given in a separate session. An additional 15 minutes were allocated to the BYU group
for learning the BYU-BNC system’s part-of-speech tags, pattern searching, and color-coding. The
two collocation retrieval tasks were:
1. Collect 10 collocations of a given word, support, including both the noun and verb forms
2. Collect 10 collocations of a given topic, “environmental protection”
The first task asked students to collect collocations containing the word support as a noun and a
verb. This was to draw students’ attention to multiple class words and their associated collocations.
The second task was to collect collocations that the students thought would be useful in writing an
essay on “environmental protection”.
Results
The FlaxLC group completed the two tasks within an average of 25 minutes, and the BYU group
within an average of 35 minutes. Each group collected 320 collocations. The results are given in Table
1. In the first column, collocations are divided into two-word and multi-word groups according to the
number of constituents. Two-word collocations conformed to the patterns of adjective + noun (financial
support), noun + noun (government support) and verb + noun (caused damage). The multi-word
group is further divided into sub-groups based on syntactic patterns: verb + noun (including articles
and prepositions) (support this view, protect the interest of), verb/adjective/noun + infinitive-to +
verb (necessary to support), verb/noun + preposition + noun (threat to the environment, impact on
the environment), and noun + of + noun (lack of support, destruction of the environment).
Table 1 indicates that 84% of the BYU participants’ collocations were made up of two-word
collocations, compared with 45.6% of two-word collocations collected by the FlaxLC participants.
When we look at the multi-word collocations collected, the FlaxLC participants collected twice as
many (77 vs. 34) verb + noun collocations than the BYU participant group with 86% (66/77) of the
FlaxLC participants’ multi-word collocations containing articles and prepositions compared with
11.8% (4/34) collected by the BYU participant group. Furthermore, compared with the BYU group,
the FlaxCL group collected more than seven times the amount of verb/adjective/noun + infinitive-to
+ verb (42 vs. 4) and verb/noun + preposition + noun collocations (21 vs. 3). The number of Noun
+ of + noun multi-word collocations collected by the respective groups painted a similar trend, with
three times more collected by the FlaxLC group than the BYU group.
61
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
multi-word collocations from the FlaxLC group indicate students’ preferences for collecting and
employing longer chunks as these serve as “points of fixation” or “islands of reliability” for developing
language proficiency (Dechert, 1984). The FlaxLC participants also showed a tendency in favor of
collecting collocations that contain articles and prepositions when they are presented directly in the
FlaxLC search results. Both groups showed an interest in noun + of + noun and noun + preposition +
noun patterns that have been identified as two of the key components for grammatical complexity and
text density in academic writing (Biber et al., 2011; Biber & Gray, 2011; Halliday 1993), but which
studies show have been underused in EAP student writing (Lu, 2011; Parkinson & Musgrave; 2014).
The resulting 13.1% of verb/adjective/noun + infinitive-to + verb collocations collected suggest the
popularity of this type of grammatical pattern among the FlaxLC students. However, more studies
are needed to confirm if these findings apply to other ethnic groups with different L1s.
The results of this study have supported our decision to include grammatical collocations
and multi-word collocations that contain articles and prepositions. They also suggest that further
investigation of the effectiveness of noun + preposition + noun and noun + of + noun pattern in
DDL research.
FlaxLC records user queries (user actions or requests for information) in log files while the user is
interacting with the system. We analyzed user queries sent to FlaxLC over the period of one year
(from June 2016 to June 2017) to further investigate the research question, how learners use FlaxLC
to look up collocations on a large scale. In particular, our analysis helped us to establish FLaxLC
user origins, user preferences, and typical user behaviors. It also provided insights for improving
FlaxLC in the future.
62
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
The user enters a query term by typing in a word (advantage) or by selecting a word from one
of the search suggestions (i.e., family words, synonyms, antonyms, or related words) to retrieve
collocations. FlaxLC returns collocations and displays them on a web page as shown in Picture 1.
Interaction 2 (Viewing Extended Collocations)
The user clicks on a hyperlinked collocation (take advantage of) to view extended collocations
as shown in Picture 2.
Interaction 3 (Viewing More collocations)
The user clicks the “more” button to retrieve more collocations (offer the advantage of) as shown
in Picture 3.
Interaction 4 (Viewing Sample Sentences)
The user clicks on a hyperlinked collocation (took full advantage of) to view sample sentences
containing collocations in context.
During the course of interactions, the user can leave the system at any point and at any time. For
example, the user could enter a query term, have a look at the collocations from the search results and
then depart. User interactions or a sequence of interactions as shown above in Figure 6 are recorded
as query entries in log files. Below are three examples.
63
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
Square brackets divide an entry into three parts. Timestamp (2016-09-03 04:14:34) tells when
a query arrives at the system. Query parameters (s=CollocationQuery... s1.from=wf) provide the
details for decoding an interaction. The last part (31.205.128.106) gives the geographic region of a
query, in this case, from Birmingham, UK. The three entries above indicate the following sequence
of interactions from a user:
• clicked the word advantage (s1.query= advantage) on the “family words” panel (s1.from=wf),
• chose the BAWE corpus (s1.dbName=BAWE),
• after 6 seconds, clicked take advantage of for extended collocations
• after 3 seconds, clicked took full advantage of for the sample sentences.
Using the user query entries recorded over one year, we have conducted a general analysis to
provide a statistical overview of how the system is used and a user behavior analysis to examine how
users interact with the system in depth.
General Analysis
This analysis provides simple statistics about the number of queries, the geographic regions of
queries, and users’ preference among the three databases (see section “Building FlaxLC from different
corpora”).
354,694 queries from 67 countries were recorded with an average of 971 queries per day. Table
2 shows the top 10 countries and corresponding percentages. Queries from 57 other countries are
grouped under the “Other” category. About two thirds (65%) of queries were from three English-
speaking countries: The United Kingdom (28%), New Zealand (24%) and Australia (13%). The
Republic of Korea is at the top of the list among all non-English-speaking countries, followed by
China, Russia, Belarus and Israel.
The popularity scores of the three databases—Wikipedia, BAWE, and BNC—are given in Table
3, along with the statistics of user preferences by country. The Wikipedia database (53.2%) is the most
popular, but we need to consider that the Wikipedia corpus is the default corpus offered by the FlaxLC
system, i.e. users need to select the BAWE or BNC corpora from the drop-down menu and switch
explicitly. The BAWE corpus comes in at second place and this may indicate an increased focus on
learning academic English by users. The user preferences by country shows that New Zealand users
prefer Wikipedia and the BNC, and that users in the Republic of Korea prefer the Wikipedia corpus.
The BAWE corpus is the most popular among United Kingdom users (50.9%) where the BAWE
corpus was incidentally developed at three UK universities, followed by Australian users (21.1%).
The results are mixed and not distinctive among other countries.
64
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
Table 4 gives the statistics on how users formulate a query term. 219,190 query terms were
submitted, with the average being 601 per day. 82.64% are single word queries (i.e., users type in
a single word). Only a small percentage (8.7%) are multi-word queries. Family words are the most
popular query formation aid, followed by synonyms, related words, and antonyms. Notably, the
misspelling rate in query terms was 0.5% (840 out of 181,508), compared with 28% before the Word
Autocomplete facility was available on FlaxLC.
65
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
times over the period of one year, more than once per day. Multi-word queries painted an interesting
picture. Users tended to include articles and prepositions in their query terms (make a mistake, scale
of problem). There were a number of phrasal verbs (point out, lead to, focus on, carry out, make up),
and discourse markers commonly used in academic writing (due to, in order to, in terms of, such
as, as well as).
Term Categories
Search engine query terms are commonly classified by topics pertaining to those such as Sexual,
Social, Education, Sports, News and so on in Web user analyses (Li et al., 2005; Jansen et al. 2000;
Ross & Wolfram 2000). This approach is not useful for the Flax project, however, because our users
seek language patterns not websites for information. Instead we grouped query words into four
categories using wordlists developed by language researchers.
66
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
Table 7 shows the statistics of query words as they correspond to each of the word list and off-
list word categories. Academic words (40.2%) are the most popular words, followed by West’s top
1000 words (24%), and West’s top 2000 words (10.8%). That is 75% of word queries are from either
West’s or Coxhead’s word lists and only 25% of queries are made up of off-list words. The top 100
query words in each category are given in the Appendix B.
We also examined the number of unique query words in each word list. The results are given in
Table 8 where the second column displays the total number of words (i.e., headwords plus family
words) in each word list. It was shown that 86% of Coxhead’s academic words occurred in query
words with the number of query words in West’s top 1000 words being slightly higher than that of
West’s top 2000 words (66.9% vs. 54.3%).
The series numbered 1-5 above indicate that the user made five sequential queries: entered a
query term; viewed extended collocations; viewed sample sentences of a collocation; entered another
query term; and viewed more collocations. We computed the number of queries, types of queries, and
time spent on the website from a series of user queries within a 10-minute time frame. Allocating
10 minutes as the time frame follows guidelines from an experiment that showed 66% of our users
make 2-10 queries per day and operating on the assumption that the average user would spend a
maximum of 1 minute per query.
Table 8 shows the number of queries made within 10 minutes. About 30% of users made one
query. The majority (69%) made more than two queries, with 48.8% making two to five, 14.5% making
six to ten and 5.7% making eleven to twenty queries. A small group (1.2%) made at least twenty-one
queries. The average number of queries per user was 6.29 over the 10-minute time frame.
67
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
The statistics for the types of queries are based on the four possible interactions shown in Figure
6, i.e., Retrieving Collocations (Interaction 1 in Figure 6), Viewing Extended Collocations (Interaction
2 in Figure 6), Viewing More Collocations (Interaction 3 in Figure 6) and Viewing Sentence Samples
(Interaction 4 in Figure 6). Table 10 shows that retrieving collocations makes up 69.8% of interactions,
followed by viewing extended collocations (15.3%), viewing sample sentences (8.2%), and viewing
more collocations (6.7%).
The duration of a series of queries is calculated by subtracting the time of the last query, e.g., (5)
viewing more collocations, and the time of the first query, e.g., (1)retrieving collocations. Noting that
the duration of a series is only an estimation of time spent on the website for users to make at least
two more queries (i.e., the duration is zero for series comprising only one query). The calculation
does not include the time the user spent after the last query, however (e.g., time spent on (5)view
more collocation). Table 11 provides the statistics for time spent on the FlaxLC website. 41.5% of
users spent less than 2 minutes in total, which corresponds to 48.8% of users who made two to five
queries as shown in Table 10. 38.5% (13.8% + 11.7% + 12.9%) of users spent a total of two to eight
minutes on the FlaxLC website whereas 20.1% of users managed to keep themselves busy within
the FlaxLC website for the total duration of 10 minutes. The average number for time spent on the
FlaxLC per user was 4.01 minutes over 10 minutes.
68
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
69
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
words students are having difficulty with or are interested in in terms of learning and employing
collocations.
The dramatic decrease of misspelling in query terms and the popularity of query formation
aids—word autocomplete, family words, synonyms, related words and antonyms—suggest that such
facilities are essential in a learner friendly corpus tool. A further route for development would be to
use word embedding technology 6 to provide semantically similar words in general, and academic
text in different disciplines.
Our user engagement measures were based on the assumption that in the duration of 10 minutes
the more time users spent on the website, the more queries they made the more engaged they were.
The average number of queries is 6.29 and the average time spent on FlaxLC is 4.01 minutes per
user. However, it is infeasible to compare our results with those in similar studies because different
durations would yield different results. The closest that we have found in the literature is Jansen et
al.’s study on an Internet search service engine called Excite 7 in which the average number of queries
per user is 2.84, but they did not specify the duration of time in the paper. Comparing the time spent
on FlaxLC with that on other websites is even harder because of the different nature of different
websites, for example, a user would most likely spend more time on YouTube or a gaming website
than on a university’s homepage. Nevertheless, we would argue that the users were engaging with
FlaxLC for nearly 30% of non-collocation retrieval queries (e.g., viewing extended collocations,
viewing more collocations, viewing sample sentences), suggesting that the users spent a reasonable
amount of time on examining the search results in FlaxLC, e.g. studying how a collocation is used
in context in the example sentences.
In summary, this initial user query analysis has provided valuable insights that would be hard
to gain from small and short-term user studies. In spite of the benefits, such an analysis has its
weaknesses. Our data are only based on observable artefacts of what the users actually did: in our
case, when they searched, how long they stayed on the website, what word(s) they searched for,
which facilities they used (synonyms, antonyms, related words, etc.), whether they looked up sample
sentences of a collocation and so on. We know much less about why they are doing what they are doing
and whether they are satisfied with the results of the system. We also did not have any information
about the users themselves (except their geographic regions) or about what they do with the search
results—which collocations were taken away and whether or not they were used and how. This
limitation must be considered in analyses and complemented by other techniques (e.g., think aloud
protocols in combination with user studies and surveys to gather perception data) to provide a more
complete understanding of user behavior.
CONCLUSION
Collocation learning has been recognized as one of the most challenging and important aspects of
language learning. Corpus consultation provides a promising way for learners to self-study collocations
in their own time. There are many corpus tools, with different user interfaces, some are free, and
some are not, some are available for students to use, but those that are designed for and dedicated to
language learners are few and far between. We have designed and built a learner friendly collocation
consultation system that utilizes learners’ existing familiarity with dictionaries and search engines.
FlaxCL offers great potentials in facilitating DDL in the language classroom, particularly with
collocation learning through corpus consultation. Here we present some ideas that are demonstrated
by Wu, et al. (2016) in preparing students for essay-writing, whereby students are asked to collect
collocations or related words that are germane to a specific writing topic. This approach is likened
to brainstorming wherein new and inspiring ideas may be encountered through the collocations and
related words functions. FlaxCL can also help students find the right words to express their ideas, for
example, finding appropriate verbs or adjectives for a particular noun, or adding adverbs to qualify
statements (i.e. hedging and boosting). To increase lexical range in student writing, students can use
70
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
FlaxCL to find synonyms, or other members of the same word family to avoid overusing the same
word (e.g. searching for benefit generates the family word beneficial, and also provides an expanded
range of verb usages such as benefit consumers, benefit greatly from, able to benefit from, and benefit
from the use of.)
FlaxLC currently houses three databases built from the British Academic Written English Corpus,
the British National Corpus and a Wikipedia corpus comprised of three million articles. New and
extensive databases with high-quality academic text in different disciplines will be developed in
response to the increasing preference for academic English corpora.
The user study we have conducted was small and simple, but it served well in assessing learner
interests in collecting collocations in FlaxLC. The results show a preference for longer collocations
and particular collocation patterns (e.g., verb/noun/adjective + infinitive-to + verb) with the group
of students that participated in our study. Further investigation is needed to find out whether the
results would vary with students with different English education and L1 backgrounds. By scaling
this research to include students from different backgrounds may lead to pedagogical implications
for designing and providing targeted collocation tasks for students of different English education
and L1 backgrounds. Longitudinal studies where FlaxCLS is embedded into language activities
(for example, vocabulary learning and writing) in a classroom would also serve to shed more light
on how teachers and students intend to and actually use FlaxCL in their DDL practice. We invite
participation from teachers and researchers, and believe that further participation from the language
education community will lead to further refinement of the system.
Our initial user query analysis, which has the benefit of being easy to capture at scale, not only
provided suggestions for improving usability and experience of our system, but it also revealed
interesting facts on how FlaxLC is used. Such analyses would provide valuable information and
suggestions for DDL researchers and language teachers when helping their students study collocations.
It could also go some way toward answering research questions like what makes a word difficult to
learn by examining the collocations that students have looked at, or whether query terms are different
according to different geographical regions. We have recently added new facilities to track user
interactions with the system in more detail to identify patterns of users’ query reformulation strategy
(i.e. site searching strategy). These additional facilities will also allow us to draw a comparison between
this analysis and a further one in a year’s time, along with a comparison between users from English
speaking and non-English speaking countries. The results would yield new and in-depth insights for
understanding user behavior in corpus consultation.
71
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
REFERENCES
Ackermann, K., & Chen, Y.-H. (2013). Developing the Academic Collocation List (ACL) - A corpus-driven
and expert-judged approach. Journal of English for Academic Purposes, 12(4), 235–247. doi:10.1016/j.
jeap.2013.08.002
Benson, M., Benson, E., & Ilson, R. (1986). The BBI combinatory dictionary of English: A guide to word
combinations. Amsterdam: John Benjamins. doi:10.1075/z.bbi1(1st)
Benson, M., Benson, E., & Ilson, R. (1997). The BBI dictionary of English word combinations. Amsterdam:
John Benjamins. doi:10.1075/z.bbi1(2nd)
Biber, D., & Gray, B. (2011). Grammatical change in the noun phrase: The influence of written language use.
English Language and Linguistics, 15(2), 223–250. doi:10.1017/S1360674311000025
Biber, D., Gray, B., & Poonpon, K. (2011). Should we use characteristics of conversation to measure grammatical
complexity in L2 writing development? TESOL Quarterly, 45(1), 5–35. doi:10.5054/tq.2011.244483
Boulton, A. (2012a). Hands-on / hands-off: Alternative approaches to data-driven learning. In J. Thomas & A.
Boulton (Eds.), Input, process, and product: Developments in teaching and language corpora (pp. 152–168).
Brno, Czech Republic: Masaryk University Press.
Boulton, A. (2012b). Wanted: Large corpus, simple software. No timewasters. A. Len ko- Szymańska. In TaLC10:
10th International Conference on Teaching and Language Corpora (pp. 1-6). Warsaw, Poland: Academic Press.
Boulton, A. (2015). Applying data-driven learning to the Web. In A. Leńko-Szymańska & A. Boulton (Eds.),
Multiple Affordances of Language Corpora for Data-driven learning (pp. 267–295). Amsterdam: John Benjamins.
Boulton, A., & Cobb, T. (2017). Corpus Use in Language Learning: A Meta-Analysis. Language Learning,
67(2), 348–393. doi:10.1111/lang.12224
Chambers, A., & O’Sullivan, Í. (2004). Corpus consultation and advanced learners’ writing skills in French.
ReCALL, 16(1), 158–172. doi:10.1017/S0958344004001211
Chan, T.-P., & Liou, H.-C. (2005). Effects of Web-based Concordancing Instruction on EFL Students’
Learning of Verb – Noun Collocations. Computer Assisted Language Learning, 18(3), 231–251.
doi:10.1080/09588220500185769
Chang, J.-Y. (2014). The use of general and specialized corpora as reference sources for academic English
writing: A case study. ReCALL: the Journal of EUROCALL, 26(2), 243–259. doi:10.1017/S0958344014000056
Chen, H.-J. H. (2011). Developing and evaluating a web-based collocation retrieval tool for EFL students and
teachers. Computer Assisted Language Learning, 24(1), 59–76. doi:10.1080/09588221.2010.526945
Conroy, M. (2010). Internet tools for language learning: University students taking control of their writing.
Australasian Journal of Educational Technology, 26(6), 861–882. doi:10.14742/ajet.1047
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238. doi:10.2307/3587951
Daskalovska, N. (2015). Corpus-based versus traditional learning of collocations. Computer Assisted Language
Learning, 28(2), 130–144. doi:10.1080/09588221.2013.803982
Dechert, H. W. (1984). Second language production: Six hypotheses. In H. W. Dechert, D. Mohle, & M. Raupach
(Eds.), Second language productions (pp. 211–230). Tübingen, Germany: Gunter Narr Verlag.
Firth, J. R. (1957). Modes of Meaning. Papers in Linguistics 1934-51. Oxford University Press.
Gao, Z.-M. (2011). Exploring the effects and use of a Chinese-English parallel concordancer. Computer Assisted
Language Learning, 24(3), 255–275. doi:10.1080/09588221.2010.540469
Geluso, J., & Yamaguchi, A. (2014). Discovering formulaic language through data-driven learning: Student
attitudes and efficacy. ReCALL, 26(2), 225–242. doi:10.1017/S0958344014000044
Halliday, M. A. K. (1993). Some grammatical problems in scientific English. In M. A. K. Halliday & J. R.
Martin (Eds.), Writing science (pp. 69–85). London: The Falmer Press.
72
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
Hill, J., & Lewis, M. (Eds.). (1997). LTP Dictionary of Selected Collocations. LTP.
Jansen, B. J., & Spink, A. (2006). How are we searching the world wide web?: A comparison of nine search
engine transaction logs. Information Processing & Management, 42(1), 248–263. doi:10.1016/j.ipm.2004.10.007
Jansen, B. J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: A study and analysis
of user queries on the web. Information Processing & Management, 36(2), 207–227. doi:10.1016/S0306-
4573(99)00056-4
Johns, T. (1991). Should you be persuaded: Two examples of data-driven learning. English Language Research
Journal, 4, 1–16.
Leńko-Szymańska, A., & Boulton, A. (2015). Multiple affordances of language corpora for data-driven learning.
Amsterdam: John Benjamins Publishing Company. doi:10.1075/scl.69
Lewis, M. (2008). Implementing the lexical approach: Putting theory into practice. London: Heinle Cengage
Learning.
Li, Y., Zheng, Z., & Dai, H. K. (2005). Kdd cup-2005 report: facing a great challenge. SIGKDD Explor. Newsl.,
7, 91–99. http://doi.acm.org/10.1145/1117454.1117466
Lu, X. (2011). A corpus-based evaluation of syntactic complexity measures as indices of college-level ESL
writers’ language development. TESOL Quarterly, 45(1), 36–62. doi:10.5054/tq.2011.240859
Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.). Cambridge, UK: Cambridge University
Press. doi:10.1017/CBO9781139858656
Nattinger, J. R., & DeCarrico, J. S. (1992). Lexical phrases and language teaching. Oxford, UK: Oxford
University Press.
Nesi, H., & Gardner, S. (2012). Genres across the Disciplines. Cambridge University Press.
Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for
teaching. Applied Linguistics, 24(2), 223–242. doi:10.1093/applin/24.2.223
O’Sullivan, I., & Chambers, A. (2006). Learners’ writing skills in French: Corpus consultation and learner
evaluation. Journal of Second Language Writing, 15(1), 49–68. doi:10.1016/j.jslw.2006.01.002
Oxford Advanced Learners’ Dictionary. (2000). (6th ed.). Oxford University Press.
Oxford Collocation Dictionary for Students of English. (2009). (2nd ed.). Oxford University Press.
Parkinson, J., & Musgrave, J. (2014). Development of noun phrase complexity in the writing of English
for Academic Purposes students. Journal of English for Academic Purposes, 14, 48–59. doi:10.1016/j.
jeap.2013.12.001
Ross, N. C. M., & Wolfram, D. (2000). End user searching on the internet: An analysis of term pair topics
submitted to the excite search engine. Journal of the American Society for Information Science, 51(10), 949–958.
doi:10.1002/1097-4571(2000)51:10<949::AID-ASI70>3.0.CO;2-5
Shei, C. C. (2008). Discovering the hidden treasure on the Internet: Using Google to uncover the veil of
phraseology. Computer Assisted Language Learning, 21(1), 67–85. doi:10.1080/09588220701865516
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford, UK: Oxford University Press.
Tribble, C. (2015). Teaching and language corpora: Perspectives from a personal journey. In A. Leńko-
Szymańska & A. Boulton (Eds.), Multiple Affordances of Language Corpora for Data-driven learning (pp.
37–62). Amsterdam: John Benjamins. doi:10.1075/scl.69.03tri
Varley, S. (2009). I’ll just look that up in the concordancer: Integrating corpus consultation into the language
learning environment. Computer Assisted Language Learning, 22(2), 133–152. doi:10.1080/09588220902778294
Vyatkina, N. (2016). Data-driven learning of collocations: Learner performance, proficiency, and perceptions.
Language Learning & Technology, 20(3), 159–179.
73
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
Wei, Y. (1999). Teaching collocations for productive vocabulary development (Report No. FL 026913).
Developmental Skills Department, Borough of Manhattan Community College, City University of New York.
West, M. (1953). A general service list of English words. Longman, Green & Co.
Wu, S., Franken, M., & Witten, I. H. (2009). Refining the use of the web (and web search) as a language teaching
and learning resource. Computer Assisted Language Learning, 22(3), 249–268. doi:10.1080/09588220902920250
Wu, S., Li, L., Witten, I. H., & Yu, A. (2016). Constructing a collocation learning system from the Wikipedia
corpus. International Journal of Computer-Assisted Language Learning and Teaching, 6(3), 18–35. doi:10.4018/
IJCALLT.2016070102
Yeh, Y., Li, Y.-H., & Liou, H.-C. (2007). Online synonym materials and concordancing for EFL college writing.
Computer Assisted Language Learning, 20(2), 131–152. doi:10.1080/09588220701331451
Yoon, H. (2008). More than a linguistic reference: The influence of corpus technology on L2 academic writing.
Language Learning and Technology. Retrieved from http://llt.msu.edu.ezproxy.waikato.ac.nz/vol12num2/yoon.
pdf
Yoon, H., & Hirvela, A. (2004). ESL student attitudes toward corpus use in L2 writing. Journal of Second
Language Writing, 13(4), 257–284. doi:10.1016/j.jslw.2004.06.002
ENDNOTES
1
BYU-BNC (http://corpus.byu.edu/bnc/) is British National Corpus in Brigham Yong Corpora (BYU)
collections.
2
http://flax.nzdl.org/greenstone3/flax?a=fp&sa=collAbout&c=collocations
3
http://lexically.net/downloads/BNC_wordlists/e_lemma.txt
4
https://wordnet.princeton.edu/wordnet/
5
This can be experienced by visiting http://flax.nzdl.org/greenstone3/flax?a=fp&sa=collAbout&c=collo
cations and typing in “advantage” in the query box.
6
A technique that quantifies and categorizes semantic similarities between words based on their distributional
properties in a large body of text.
7
http://www.excite.com/
74
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
Appendix A
75
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
Appendix B
Table 13. Top 100 query words grouped by word lists(Query words in West’s top 1000 words.)
76
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
77
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019
Shaoqun Wu is a senior lecturer in the computer science department at University of Waikato, New Zealand. Her
research interests include computer assisted language learning, mobile language learning, supporting language
learning in MOOCs, digital libraries, natural language processing, and computer science education.
Alannah Fitzgerald is postdoctoral research fellow with the FLAX language project at the University of Waikato
in Aotearoa. She is an open education practitioner and researcher working across formal and non-formal higher
education. Her research interests include Computer-Assisted Language Learning, Data-Driven Learning, English
for Specific and Academic Purposes, Massive Open Online Courses, Open Educational Practices, and Self-
Regulated Learning.
Alex Yu is a senior lecturer at Centre for Business, Information Technology and Enterprise at Waikato Institute of
Technology. He is also one of the core developers of the FLAX project. His research interests include computer
assisted language learning, MOOCs, mobile language learning, and data mining.
78