0% found this document useful (0 votes)
15 views26 pages

Developing and Evaluating A Learner Friendly Collocation System With User Query Data

Developing-and-Evaluating-a-Learner-Friendly-Collocation-System-With-User-Query-Data

Uploaded by

Amal Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views26 pages

Developing and Evaluating A Learner Friendly Collocation System With User Query Data

Developing-and-Evaluating-a-Learner-Friendly-Collocation-System-With-User-Query-Data

Uploaded by

Amal Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

International Journal of Computer-Assisted Language Learning and Teaching

Volume 9 • Issue 2 • April-June 2019

Developing and Evaluating a


Learner-Friendly Collocation
System With User Query Data
Shaoqun Wu, University of Waikato, Hamilton, New Zealand
Alannah Fitzgerald, University of Waikato, Hamilton, New Zealand
Alex Yu, Centre for Business, Information Technology, and Enterprise, Wintec, Hamilton, New Zealand
Ian Witten, University of Waikato, Hamilton, New Zealand

ABSTRACT

Learning collocations is one of the most challenging aspects of language learning as there are literally
hundreds of thousands of possibilities for combining words. Corpus consultation with concordancers
has been recognized in the literature as an established way for language learners to study and explore
collocations at their own pace and in their own time although not without technological and sometimes
cost barriers. This paper describes the development and evaluation of a learner-friendly collocation
consultation system called FlaxLC in a design departure away from the traditional concordancer
interface. Two evaluation studies were conducted to assess the learner-friendliness of the system: a
face-to-face user study to find out how international students in a New Zealand university used the
system to collect collocations of their own interest and a user query analysis—based on an observable
artefact of how online learners actually used the system over the course of one year—to examine how
the system is used in real life to search and retrieve collocations.

Keywords
Collocation Database, Collocation Learning, Corpus-Based Language Learning, Data-Driven Learning, User
Query Data Analysis

INTRODUCTION

Collocations, recurrent word combinations, have been widely recognized as an important aspect
of vocabulary knowledge (Firth, 1957; Lewis 2008; Nattinger & DeCarrico, 1992; Sinclair, 1991,
Nation, 2013). Dictionaries and most recently, corpus analysis tools (i.e. concordancers and the like)
are the two main resources that learners draw on while acquiring such knowledge. Printed dictionaries
are the popular and traditional collocation learning resource—for example, the BBI Combinatory
Dictionary of English (Benson et al., 1984), the Dictionary of Selected Collocations (Hill and Lewis
1997) and the Oxford Collocation Dictionary (2009)—dedicated to assist learners in mastering this
essential knowledge. With corpus analysis tools, learners can enter a word and explore what words
are most likely to occur before or after it. These tools, whether web-based (e.g. the Collins COBUILD
Corpus, WebCorp, WebCollocate, Mark Davies’ Brigham Young Corpora, COCA) or stand-alone
(e.g., WordSmith Tools, AntConC) are specifically designed for linguists, and come with different
interfaces, search functions and the presentation of results that are mostly in the form of keyword-

DOI: 10.4018/IJCALLT.2019040104

Copyright © 2019, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.


53
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

in-context (KWIC) fragments and incomplete sentences. In terms of retrieving collocations, they
facilitate the search for two-or three-word collocations.
Corpus-based tools have been explored by many researchers and teachers to facilitate collocation
learning with promising results as demonstrated in the literature (Boulton & Cobb, 2017). They have
been used in helping students find correct word combinations (e.g. “casue a problem” vs. “bring
a problem”) (Yoon, 2008; Chen, 2011; Daskalovska, 2015; Vyatkina, 2016), understand the subtle
meaning of certain verbs that lack direct L1 equivalents: synonyms (e.g. construct, build, and establish),
hypernyms (e.g. create and compose) (Chan and Liou, 2005), and identify common word choice
errors in student writing (Chambers and O’Sullivan, 2004; Wu, et al. 2009). Johns (1991) used the
term “data-driven learning” (DDL) to describe this approach that centers on fostering learners’ skills
in becoming a “language researcher”. Despite DDL’s great potential as presented in the literature,
DDL has not been widely accepted by mainstream language educators (Leńko-Szymańska & Boulton,
2015, p. 3). Technical challenges that face both teachers and students go some way to explain the
reluctance in implementing DDL in classroom. This view is supported by the results of a large-scale
survey conducted by Tribble on using corpora in language teaching (Tribble, 2015).
User-friendliness and free access are reported to be two major factors in influencing the
willingness of respondents to use corpora, while “don’t know how to”, “are not familiar with” are
among the reasons for not using corpora. DDL researchers have also reported several factors that
may hinder corpus use, including requirements of metalinguistic knowledge (e.g., part-of-speech
tags) to formulate queries, unfamiliarity with complex search interfaces and functions, overwhelming
results, and difficulties in locating and interpreting target language features in concordances (Yoon
& Hirvela, 2004; O’Sullivan & Chambers, 2006; Yeh, Li, & Liou, 2007; Chen, 2011; Rodgers et
al., 2011; Boulton, 2012a; Chang, 2014; Geluso & Yamaguchi, 2014; Daskalovska, 2015). For
example, Chang (2014) asserts that the differing interfaces and functions of various corpus tools
further increases the technical challenge whereby learners generally need to learn a new system in
order to access a different corpus.
To make corpus tools accessible for language learners, Boulton (2012b) published a call to
“simplify the technology as much as possible either the corpus itself or the associated software”.
DDL researchers have also proposed improvement recommendations in corpus software development,
for example, “simplified”, “controlled” or “screened” concordancer output before being presented
to students (Varley, 2009; Daskalovska, 2015), and simple interfaces to return multi-word units,
for example, make a presentation or give a presentation for the search query word presentation
(Chan & Liou, 2005). Chen (2011) evaluated three corpus tools—the Hong Kong Polytechnic Web
Concordancer, the COBUILD Concordancer, and BNC Sample Search tool—and proposed an ideal
collocation retrieval tool, one that needs to include: the position of the search term (before or after
the query word), functions for phrase or multiple word searches, support for using L1 in queries,
syntactic information about the collocates (verb, noun, adjective, adverb), retrieval of all relevant
collocations at once, and the clustering of semantically related collocations. Over the past years, it
has been encouraging to see the development of dedicated tools for collocation learning (e.g. SKELL)
with some of the aforementioned proposed functionalities included.
This paper describes our attempts at building and evaluating a collocation retrieval system called
FlaxLC and answers the following research questions:

1. How can we design and develop a learner-friendly collocation retrieval tool?


2. How do learners use FlaxLC to look up collocations?

To answer the first research question, we proposed two design principles that aim to minimize
learning and training efforts for using a corpus-based tool. The first principle is to capitalize on learners’
familiarity with traditional language resources (i.e. dictionaries); the second is to utilize learners’
existing web skills with search engines. To answer the second research question, we conducted two

54
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

evaluations. First, 32 Chinese postgraduate students at a New Zealand university were divided into
two groups of 16, with one group receiving training in the use of the FlaxLC system and the other
group receiving training in the BYU-BNC1 corpus system to look up and collect collocations specific
to writing task topics. Second, user queries sent to FlaxLC were analysed to examine how the system
was used over the period of one year. The log data analyses that we present in this study, similar to
traditional analyses of user queries on the Web, provide interesting and revealing insights that could
not be gained from small scale focused user studies. To the best of our knowledge, this user query
data analysis approach has not been explored in DDL research.

A LEARNER-FRIENDLY COLLOCATION RETRIEVAL TOOL

FlaxLC2 capitalizes on learners’ familiarity with traditional resources—dictionaries—and utilizes


their existing skills with search engines (e.g. Google), including word autocomplete facility for
correctly spelling word queries. It provides four additional query formation aids to assist learners with
looking up collocations and houses three databases described below. In each database, collocations
are automatically extracted, organized by syntactic pattern, sorted by frequency, and linked to their
original context.

Building FlaxLC From Different Corpora


We have developed three collocation databases based on different kinds of text. The first is the British
Academic Written English (BAWE) corpus, which contains 2860 high-standard student assignments
(6 million words) covering four areas: Arts and Humanities, Social Sciences, Physical Sciences, and
Life Sciences (Nesi and Gardner, 2012). This corpus has an academic focus and allows students to
explore collocations commonly used in academic prose.
The second database comprises 90 million words from the British National Corpus (BNC), from
newspapers, specialist periodicals and journals, academic books and fiction, published and unpublished
letters and memoranda, as well as school and university essays. This represents standard English,
collected from different areas; it is suitable for determining the general usage of a particular word.
The third database comprises 3 billion words derived from Wikipedia articles, downloaded from
the Wikipedia website. This corpus represents modern English. It contains articles in many areas e.g.
art, life, science, and, importantly, emerging and contemporary topics whose vocabulary is not covered
by any other corpora. It is particularly useful for seeking topic-related key words and collocations.

Design Principles
Dictionaries are arguably the most popular resources in language study, and therefore the most
familiar to learners, which has been evidenced in an iterative and wide-ranging survey by Tribble
into the use, or lack thereof, of corpus tools, including a survey of the most commonly used language
resources placing dictionaries in the top position (Tribble, 2015). We propose that FlaxLC mimics the
structure of a traditional collocation dictionary after studying the different definitions for collocations
in the literature, and investigating the structure, organization, and language items found in traditional
collocation dictionaries. The following two questions were used to guide our software designs:

1. What are useful collocations for learners to learn?


2. What is the best way to organize and present collocations online?

Choosing Useful Collocations for Learners


There are many definitions of collocation in the literature. We adopt the notion of grammatical and
lexical collocation proposed by Benson et al. (1986) and use a syntax-oriented approach to identify
collocations by syntactic structures (e.g. verb + noun, adjective + noun, noun + verb). This approach

55
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

is used by the Oxford, BBI, and LTP dictionaries, Ackermann and Chen’s Academic Collocation
List (2013), and is supported by linguists (for instance, Firth, 1957; Sinclair, 1991), and by language
learning researchers (for instance, Nation, 2013; Nattinger & DeCarrico, 1992; Nesselhauf, 2004).
Many DDL researchers (for instance, Chan and Liou, 2005; Yoon, 2008; Wu, et al., 2009, Chen, 2011,
Daskalovska, 2015; Vyatkina, 2016) have also used corpus tools with students to study particular
collocation patterns (e.g., verb + noun, adjective + noun, verb + preposition).
We focused on collocations that contain from two to five continuous words using fourteen
different collocation patterns, eight derived from the research and development of two collocation
dictionaries, BBI Combinatory Dictionary of English (Benson, et, al. 1997) and Oxford Collocation
Dictionary for Students of English (McIntosh, et, al. 2009). We added six ourselves (see Appendix A
for all the patterns and Wu, et, al. 2016 for a detailed description). Some of these types are extended
to include more constituents of potential use to learners. For example, the noun part of a verb + noun
collocation can include a complex noun phrase involving one or more nouns coupled with modifiers
or prepositions: examples are take full advantage of, play an extremely important role. Wei (1999,
p. 4) supports this approach, arguing that it incorporates syntax into a predominantly semantic and
lexical construct, thus encompassing a wide range of data. Collocations containing very common
adverbs like more, much, very, quite are removed from the patterns involving adverbs because they
can accompany most adjectives and verbs.

Organizing and Presenting Collocations


Traditional dictionaries organize collocations by headword and allow users to look up collocations
through indexes of headwords. In our case, the main challenge when presenting the collocations
to learners is in organizing them in a way to best manage the massive volume of data, without
overwhelming learners. For any given query term there are up to fourteen collocation types; many
words belong to more than one type because their syntactic part of speech is ambiguous (for example,
the word support can be used as a verb or a noun); and some collocations have many variations (e.g.
the word advantage in take advantage of can be qualified by full, unfair, undue, greater).
A further issue is how to organize collocations containing different inflected verb forms (e.g.
taking, takes, took for the verb take). For example, take advantage of, taking advantage of, took
advantage of are the three most frequent verb + noun collocations for advantage, followed by have/
has/had the advantage of. Of these, we chose to show take advantage of and have advantage of,
suppressing the others so that other useful collocations like gain an advantage, saw the advantage
of, offer the advantage of move further up the result list for presentation to learners.
To address these issues, collocations are organized in a hierarchical structure. Figure 1 illustrates
this design idea. Collocations are first grouped by the syntactic type of the query word (e.g., used as
a noun or a verb). Then they are organized by syntactic pattern (e.g., all verb + noun collocations are
displayed together). In this case, benefit can be used as both a noun and a verb. There are eight patterns
related to the noun form and seven to the verb form; these are shown in order of frequency. Figure 1
displays the first three most popular patterns: adjective + benefit, benefit + noun, and benefit + of
+ noun. The interface contains two columns: syntactic pattern and corresponding collocations. For
each pattern, up to fifty collocation samples and their frequency are retrieved and displayed, ten at a
time. Here, own benefit, benefit concert and benefit of the doubt are the most frequent collocations
of the above three types. The more button at the bottom right reveals the rest.
For collocations that contain extensions (e.g. take full advantage of, took advantage of, taking
advantage of, takes advantage of are extensions of take advantage), only the most frequent one is
displayed; when it is clicked, the others appear in a pop-up window showing their rate of frequency.
This display of extended collocations is done by extracting two key words from the collocation,
transforming them into their base form, and using this base form for grouping as shown in Figure
2. In a further example, clicking on a collocation—in this case economic benefits—brings up a
superimposed window as shown in Figure 3. It displays similar collocations in two columns, along

56
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

with their frequency. These all contain the words economic and benefit, whether adjacent or not;
are of the adjective + noun type; and have up to five words. Many include more than one noun or
adjective, such as direct economic benefit, and significant national economic benefit. Clicking a
collocation retrieves samples in context from the original text: Figure 4 shows six samples for direct
economic benefit.
Collocations are ordered by frequency to draw learners’ attention to the most commonly used
ones. This is achieved in three ways: the most frequent syntactic type of the query word, the most
frequent collocation pattern, and the most frequent collocation. For example, with the query term
benefit, its collocations are first grouped under its noun and verb forms. The noun collocations are
displayed first because they are more frequent than the verb collocations. Within the noun group,
adjective + benefit, noun + benefit, benefit + of + noun, verb + benefit … are presented in order of
frequency, and within each collocation pattern the most frequent collocation is always listed first.
The same applies to the verb group.

Helping Learners Search for Collocations


This section explores the interface that users employ to look up collocations in FlaxLC. The design is
based on the principle of utilizing learners’ familiarity and existing skills with search engines. Using
search engines as concordancers and the Web as corpus in language learning and teaching is not
new in the DDL literature (for instance, Boulton, 2015; Conroy 2010; Shei 2008; Wu, et al., 2009).
One advantage of this approach is that search engines are “familiar to most users” (Gao 2011) and
offer a “familiar and easy way to begin simple DDL” (Boulton, 2015). We investigated two popular
search engine interfaces (Google and Yahoo) and propose two functional requirements based on the
understanding that formulating search queries is possibly one of the most challenging tasks language
learners will face when using corpus-based systems. Interfaces need to (1) be as simple as those of
search engines’: the user simply types in the word(s) of interest, and (2) provide query formation aids
similar to search suggestions with relevant feedback.
FlaxLC provides a simple interface. To look up collocations, the user simply types in the word
of interest and selects a database: contemporary English (Wikipedia), academic English (i.e. BAWE),
or standard English (i.e., BNC).

Figure 1. Family words, synonyms, related words, definitions, related topics, and collocations associated with the word benefit

57
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Figure 2. Extended collocations and frequency of take advantage of

Family Words, Synonyms, and Antonyms


Family words, synonyms and antonyms are popular resources students make use of to study a word.
FlaxLC uses Yasumasa Someya’s lemma list 3for English, which was developed in 1998, to retrieve
inflected and derived forms of the query term. For the query benefit, the family words beneficial,
beneficiary, beneficiaries, benefited, benefiting and benefits are displayed as shown in Figure 1.
WordNet (a large lexical database of English4) is also consulted to identify words that are related to
or associated with a particular query term. For benefit, the verb synonyms include search, explore
and investigate, and the noun synonyms include investigation, investigating, inquiry and enquiry.

Word Autocomplete
Misspelling is common in search engine queries. Wang, et al. (2013) reported a misspelling rate of
26% for query words on an academic website they examined. Google’s autocomplete facility consults
historical query terms to provide hints while the user is typing. This approach of reusing historical
query terms is not applicable for FlaxLC user queries because learners’ vocabulary is likely to be
small and limited; therefore, the misspelling rate would be high in leaners’ historical query terms.
FlaxLC’s Word Autocomplete function helps users form correct English words by consulting a
specially built dictionary that is comprised of 32,000-word entries extracted from a Wikipedia article
corpus of three-billion words. Words are sorted by frequency; inflected forms of a word (e.g. takes,
taken, taking for the word take) and rare words (i.e. that occur only once in Wikipedia) are omitted to
achieve a good user interface response time. To avoid overwhelming users with too many language
choices, only up to twenty suggestions are given at a time, for example, present, previous, president,
previously, pressure, presence, press, prevent, presidential, preparation, prestigious, predecessor,
premiere, predominantly, pregnant, presentation, present-day, prepare are presented when the letters
“pre” are typed into the search box.

Related Words
The related words function in FlaxLC extends Chen’s (2011) idea of retrieving words that are
semantically related to the query term. It has been designed to help learners expand their word and
collocation knowledge, especially in domain-specific areas, or on topics related to what they are
studying. We explored the possibility of using the publicly available and growing Wikipedia corpus
of articles to present related words and collocations. FlaxLC first finds the best matching Wikipedia
article and then the keywords and collocations of that article are retrieved. The collocations are
then grouped by the keywords they contain. Figure 5 shows the first 40 words related to the query
animal testing, sorted by TF-IDF (term frequency-inverse document frequency) score, including
animal, primates, test, experiments, research, vivisection, etc. This metric, which is commonly used
in information retrieval (called TF-IDF, and described by, for example, Witten et al., 1999), is used
to rank words related to the query, so that they can be displayed in descending order of relatedness.
Clicking one of the hyperlinked related words, say toxicity, reveals collocations associated with that

58
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Figure 3. Extended collocations and frequency of economic benefit

word. For example, adjective collocates (acute, general, chronic, embryonic toxicity), verb collocates
(reflect, involve, evaluate toxicity), and noun phrases (toxicity tests, sign of toxicity, toxicity of a
substance). More words can be displayed by clicking the more button. Towards the end of the list
the words become more general: for example, the last group of words related to animal testing are
population, line, end, series, form, play, have, be.

EVALUATIONS

After building the system, evaluations were conducted to investigate the second research question, how
learners use FlaxLC to look up collocations. Unlike other DDL research that primarily examines the
educational effectiveness of corpus consultation in language learning (with many promising results
reported in the literature), we focused on learner interests and behaviors while they are engaging
with the system. A user study was conducted to find out the types of collocations that learners are
interested in, and a user query analysis was carried out to discover user behaviors in interaction with
FlaxLC. This section first looks at a small-scale study with 32 language learners. Following is a
presentation of data analysis of user queries over the period of one year.
59
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Figure 4. Text samples of direct economic benefit

Figure 5. Related words for animal testing

User Study
Participants recruited for the study included 32 Chinese postgraduates, aged 20 to 30, who were
primarily male students, 20 males and 12 females. The participants were divided into two groups, with
one group using FlaxLC and the other using the BYU-BNC. Our primary reasons for conducting a
comparative study with BYU-BNC were two-fold: first, to ascertain whether FlaxLC is easier to use
in terms of retrieving collocations and second, to determine whether language leaners are interested
in employing collocations in their writing that consist of more than two words and which would
normally require more effort to identify using a traditional linguistic tool such as a concordancer.
The study took place as part of a tutorial that provides academic writing training to international
students. The tutorial is intended to introduce FlaxLC to students so that they can use it to look up the
collocations of a word while writing a report. This experiment enabled us to determine what types of
collocations students would be interested in without explicitly instructing them to focus on particular
patterns, for instance, verb + preposition in Vyatkina’s study (2016), verb + noun in Chen’s study
(2011), or verb + adverb in Daskalovska’s study (2015). We did not use SKELL, which is another
dedicated tool for learners that bears the closest resemblance to our system, because traditional
concordancers (Mark Davies’ Brigham Yong Corpora (BYU), Wordsmith Tools, AntConc, etc.) are the
most widely used tools according to Tribble’s survey (2015) and those reported in the DDL literature.

60
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Method
Both groups undertook the same two collocation retrieval tasks over the period of one hour, at
different time slots, using FlaxLC and BYU respectively. None of the participating students had used
either corpus tool prior to the study. To allow us to effectively compare the results of the two groups,
the FlaxLC group was instructed to use the BNC corpus only (and not the Wikipedia and BAWE
corpora from the drop-down menu options in the FlaxLC) as it is the same corpus used within the
BYU-BNC system. Before completing the collocation collection tasks, participating students from
both groups attended the same tutorial where our definition of collocations was explained, and the
importance of collocation learning was discussed. Instructions and demonstrations on how to use the
tools were given in a separate session. An additional 15 minutes were allocated to the BYU group
for learning the BYU-BNC system’s part-of-speech tags, pattern searching, and color-coding. The
two collocation retrieval tasks were:

1. Collect 10 collocations of a given word, support, including both the noun and verb forms
2. Collect 10 collocations of a given topic, “environmental protection”

The first task asked students to collect collocations containing the word support as a noun and a
verb. This was to draw students’ attention to multiple class words and their associated collocations.
The second task was to collect collocations that the students thought would be useful in writing an
essay on “environmental protection”.

Results
The FlaxLC group completed the two tasks within an average of 25 minutes, and the BYU group
within an average of 35 minutes. Each group collected 320 collocations. The results are given in Table
1. In the first column, collocations are divided into two-word and multi-word groups according to the
number of constituents. Two-word collocations conformed to the patterns of adjective + noun (financial
support), noun + noun (government support) and verb + noun (caused damage). The multi-word
group is further divided into sub-groups based on syntactic patterns: verb + noun (including articles
and prepositions) (support this view, protect the interest of), verb/adjective/noun + infinitive-to +
verb (necessary to support), verb/noun + preposition + noun (threat to the environment, impact on
the environment), and noun + of + noun (lack of support, destruction of the environment).
Table 1 indicates that 84% of the BYU participants’ collocations were made up of two-word
collocations, compared with 45.6% of two-word collocations collected by the FlaxLC participants.
When we look at the multi-word collocations collected, the FlaxLC participants collected twice as
many (77 vs. 34) verb + noun collocations than the BYU participant group with 86% (66/77) of the
FlaxLC participants’ multi-word collocations containing articles and prepositions compared with
11.8% (4/34) collected by the BYU participant group. Furthermore, compared with the BYU group,
the FlaxCL group collected more than seven times the amount of verb/adjective/noun + infinitive-to
+ verb (42 vs. 4) and verb/noun + preposition + noun collocations (21 vs. 3). The number of Noun
+ of + noun multi-word collocations collected by the respective groups painted a similar trend, with
three times more collected by the FlaxLC group than the BYU group.

Discussion of User Study Findings


The user study described above helped identify which types of collocations learners would be interested
in collecting and employing in their writing as the first step to answer the research question, how leaners
use FlaxLC to look up collocations. We compared the collocations collected in a one-hour session
using FlaxLC and a traditional corpus tool, BYU. The large percentage of two-word collocations—in
the pattern of verb + noun, noun + noun, and adjective + noun—in both groups (84.5% and 45.6%)
would suggest a greater acceptance of two-word collocations among the student groups. 54.6% of

61
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Table 1. Collocations collected by FlaxLC and BYU participant groups

Collocation Pattern Number of Percent of


Collocations Collocations
FlaxLC BYU FlaxLC BYU
Two-word collocations 146 269 45.6% 84.1%
adjective + noun (financial support), noun + noun (government
support), verb + noun (caused damage)
Multi-word verb + noun, including articles and 77 34 24.1% 10.6%
collocations prepositions
(solve the issues, minimize the impact of)
verb/adjective/noun+ infinitive-to + verb 42 4 13.1% 1.3%
continue/necessary to support
verb/noun + preposition + noun 21 3 6.6% 0.9%
damage to environment
noun + of + noun 34 10 10.6% 3.1%
(degree of accuracy)
Total: 320 320 100% 100%

multi-word collocations from the FlaxLC group indicate students’ preferences for collecting and
employing longer chunks as these serve as “points of fixation” or “islands of reliability” for developing
language proficiency (Dechert, 1984). The FlaxLC participants also showed a tendency in favor of
collecting collocations that contain articles and prepositions when they are presented directly in the
FlaxLC search results. Both groups showed an interest in noun + of + noun and noun + preposition +
noun patterns that have been identified as two of the key components for grammatical complexity and
text density in academic writing (Biber et al., 2011; Biber & Gray, 2011; Halliday 1993), but which
studies show have been underused in EAP student writing (Lu, 2011; Parkinson & Musgrave; 2014).
The resulting 13.1% of verb/adjective/noun + infinitive-to + verb collocations collected suggest the
popularity of this type of grammatical pattern among the FlaxLC students. However, more studies
are needed to confirm if these findings apply to other ethnic groups with different L1s.
The results of this study have supported our decision to include grammatical collocations
and multi-word collocations that contain articles and prepositions. They also suggest that further
investigation of the effectiveness of noun + preposition + noun and noun + of + noun pattern in
DDL research.

USER QUERY ANALYSIS

FlaxLC records user queries (user actions or requests for information) in log files while the user is
interacting with the system. We analyzed user queries sent to FlaxLC over the period of one year
(from June 2016 to June 2017) to further investigate the research question, how learners use FlaxLC
to look up collocations on a large scale. In particular, our analysis helped us to establish FLaxLC
user origins, user preferences, and typical user behaviors. It also provided insights for improving
FlaxLC in the future.

User Query Entries


This section provides an overview of how users interact with the system by looking at user query
entries recorded in log files. There are four possible interactions between a user and FlaxLC as shown
in Figure 65:
Interaction 1 (Retrieving Collocations)

62
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Figure 6. User interactions with FlaxLC

The user enters a query term by typing in a word (advantage) or by selecting a word from one
of the search suggestions (i.e., family words, synonyms, antonyms, or related words) to retrieve
collocations. FlaxLC returns collocations and displays them on a web page as shown in Picture 1.
Interaction 2 (Viewing Extended Collocations)
The user clicks on a hyperlinked collocation (take advantage of) to view extended collocations
as shown in Picture 2.
Interaction 3 (Viewing More collocations)
The user clicks the “more” button to retrieve more collocations (offer the advantage of) as shown
in Picture 3.
Interaction 4 (Viewing Sample Sentences)
The user clicks on a hyperlinked collocation (took full advantage of) to view sample sentences
containing collocations in context.
During the course of interactions, the user can leave the system at any point and at any time. For
example, the user could enter a query term, have a look at the collocations from the search results and
then depart. User interactions or a sequence of interactions as shown above in Figure 6 are recorded
as query entries in log files. Below are three examples.

1. [2016-09-03 04:14:34] [s=CollocationQuery&s1.query=advantage&s1.dbName=BAWE&s1.


from=wf] [31.205.128.106]
2. [2016-09-03 04:14:40] [s=ExtendedCollocations&s1.collocation=take advantage of&s1.
dbName=BAWE] [31.205.128.106]
3. [2016-09-03 04:14:43] [s=SampleTexts&s1.collocation=took full advantage of&s1.
dbName=BAWE] [31.205.128.106]

63
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Square brackets divide an entry into three parts. Timestamp (2016-09-03 04:14:34) tells when
a query arrives at the system. Query parameters (s=CollocationQuery... s1.from=wf) provide the
details for decoding an interaction. The last part (31.205.128.106) gives the geographic region of a
query, in this case, from Birmingham, UK. The three entries above indicate the following sequence
of interactions from a user:

• clicked the word advantage (s1.query= advantage) on the “family words” panel (s1.from=wf),
• chose the BAWE corpus (s1.dbName=BAWE),
• after 6 seconds, clicked take advantage of for extended collocations
• after 3 seconds, clicked took full advantage of for the sample sentences.

Using the user query entries recorded over one year, we have conducted a general analysis to
provide a statistical overview of how the system is used and a user behavior analysis to examine how
users interact with the system in depth.

General Analysis
This analysis provides simple statistics about the number of queries, the geographic regions of
queries, and users’ preference among the three databases (see section “Building FlaxLC from different
corpora”).
354,694 queries from 67 countries were recorded with an average of 971 queries per day. Table
2 shows the top 10 countries and corresponding percentages. Queries from 57 other countries are
grouped under the “Other” category. About two thirds (65%) of queries were from three English-
speaking countries: The United Kingdom (28%), New Zealand (24%) and Australia (13%). The
Republic of Korea is at the top of the list among all non-English-speaking countries, followed by
China, Russia, Belarus and Israel.
The popularity scores of the three databases—Wikipedia, BAWE, and BNC—are given in Table
3, along with the statistics of user preferences by country. The Wikipedia database (53.2%) is the most
popular, but we need to consider that the Wikipedia corpus is the default corpus offered by the FlaxLC
system, i.e. users need to select the BAWE or BNC corpora from the drop-down menu and switch
explicitly. The BAWE corpus comes in at second place and this may indicate an increased focus on
learning academic English by users. The user preferences by country shows that New Zealand users
prefer Wikipedia and the BNC, and that users in the Republic of Korea prefer the Wikipedia corpus.
The BAWE corpus is the most popular among United Kingdom users (50.9%) where the BAWE
corpus was incidentally developed at three UK universities, followed by Australian users (21.1%).
The results are mixed and not distinctive among other countries.

User Behavior Analysis


User behavior analysis provided in depth information on user interactions with the system and was
conducted to examine (1) how users formulate query terms; (2) query terms and their occurrences;
(3) query term categorization; and (4) user engagement.

How Users Formulate Query Terms


We define a query term as a string of characters (that make up a single word or multiple words)
provided by the user to retrieve collocations. FlaxLC users formulate a query term through: typing in
a single word (with the help of Word Autocomplete); typing two or more words; clicking one of the
query formation aids (family words, synonyms, antonyms or a related words). Examining how users
formulate query terms helps us assess the usefulness of the query formation aids FlaxLC provides
to make further improvements to the design of the system.

64
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Table 2. Geographic distribution of FlaxLC users

Country Percent of Queries


United Kingdom 28%
New Zealand 24%
Australia 13%
Republic of Korea 9.5%
China 3.3%
Russia 3.2%
United States 2.8%
Canada 2.7%
Belarus 2.3%
Israel 1.8%
Other 9.4%

Table 3. Database usages and user preferences by country

Database Percent of User Preferences by Country


Queries
Wikipedia 53.2% New Zealand 26.6%
Republic of Korea 19.2%
United Kingdom 19.1%
Other 35.1%
BAWE 38% United Kingdom 50.9%
Australia 21.1%
New Zealand 14.5%
Other 13.5%
BNC 8.8% New Zealand 63.5%
United Kingdom 6.7%
United States 5.5%
Other 24.3%

Table 4 gives the statistics on how users formulate a query term. 219,190 query terms were
submitted, with the average being 601 per day. 82.64% are single word queries (i.e., users type in
a single word). Only a small percentage (8.7%) are multi-word queries. Family words are the most
popular query formation aid, followed by synonyms, related words, and antonyms. Notably, the
misspelling rate in query terms was 0.5% (840 out of 181,508), compared with 28% before the Word
Autocomplete facility was available on FlaxLC.

Query Terms and Occurrences


Query terms make up 23,700 unique single words and 12,100 unique multiple words. Table 5 and
Table 6 present the top 30 single words and top 30 multiple words, and their respective frequencies.
The top five words, impact, problem, function, evidence, significant were searched more than 500

65
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Table 4. Statistics of how users formulate a query term

Formulating a Query Term By Frequency Percentage


typing a single word 181,508 82.8%
typing multiple words 19,094 8.7%
clicking a family word 15,423 7.0%
clicking a synonym 2,339 1.1%
clicking a related word 703 0.34%
clicking an antonym 123 0.06%
Total 219190 100%

times over the period of one year, more than once per day. Multi-word queries painted an interesting
picture. Users tended to include articles and prepositions in their query terms (make a mistake, scale
of problem). There were a number of phrasal verbs (point out, lead to, focus on, carry out, make up),
and discourse markers commonly used in academic writing (due to, in order to, in terms of, such
as, as well as).

Term Categories
Search engine query terms are commonly classified by topics pertaining to those such as Sexual,
Social, Education, Sports, News and so on in Web user analyses (Li et al., 2005; Jansen et al. 2000;
Ross & Wolfram 2000). This approach is not useful for the Flax project, however, because our users
seek language patterns not websites for information. Instead we grouped query words into four
categories using wordlists developed by language researchers.

• Top 1000 words in West’s General Service List (1953)


• Top 2000 words in West’s General Service List (1953)
• Academic words in Coxhead’s Academic Word List (2000)
• Off-list words, not in any wordlists above

Table 5. Listing of the top 30 single-word query terms

Term Frequency Term Frequency Term Frequency


impact 721 effect 378 analysis 307
problem 682 approach 372 concern 306
function 655 concept 357 contribute 300
evidence 580 access 349 education 299
significant 567 challenge 342 distribution 299
issue 493 consequence 342 potential 285
benefit 447 strategy 329 promote 284
policy 415 investigate 319 aspect 283
influence 383 economic 309 increase 281

66
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Table 6. Listing of the top 30 multi-word query terms

Term Frequency Term Frequency Term Frequency


make a mistake 116 result in 35 precise idea 28
impress upon 105 take action 33 in order to 28
interacting each 79 call up 33 due to 27
scale of problem 64 according to 32 on fire 26
exposed to 54 lead to 32 roll out 26
point out 53 open door policy 30 expose to 26
attention span 42 trade 29 explanation of 24
liberalization
shining example 39 fossil fuels 29 such as 24
carry out 37 bring about 28 pay attention 23

Table 7 shows the statistics of query words as they correspond to each of the word list and off-
list word categories. Academic words (40.2%) are the most popular words, followed by West’s top
1000 words (24%), and West’s top 2000 words (10.8%). That is 75% of word queries are from either
West’s or Coxhead’s word lists and only 25% of queries are made up of off-list words. The top 100
query words in each category are given in the Appendix B.
We also examined the number of unique query words in each word list. The results are given in
Table 8 where the second column displays the total number of words (i.e., headwords plus family
words) in each word list. It was shown that 86% of Coxhead’s academic words occurred in query
words with the number of query words in West’s top 1000 words being slightly higher than that of
West’s top 2000 words (66.9% vs. 54.3%).

Measuring User Engagement


Three quantitative measures are used to examine user engagement with FlaxLC: the number of
queries, types of queries (e.g., retrieving collocations and viewing sample sentences in Figure 5),
and time spent on the website. We define a searching episode as a series of queries within a limited
duration of time (a number of minutes or hours). In this project, a series of queries, corresponding
to a sequence of interactions is as demonstrated above in Figure 6.

(1)retrieving collocations(2)viewing extended collocations(3)viewing sentence samples(4)retrieving


collocations(5)viewing more collocations

The series numbered 1-5 above indicate that the user made five sequential queries: entered a
query term; viewed extended collocations; viewed sample sentences of a collocation; entered another
query term; and viewed more collocations. We computed the number of queries, types of queries, and
time spent on the website from a series of user queries within a 10-minute time frame. Allocating
10 minutes as the time frame follows guidelines from an experiment that showed 66% of our users
make 2-10 queries per day and operating on the assumption that the average user would spend a
maximum of 1 minute per query.
Table 8 shows the number of queries made within 10 minutes. About 30% of users made one
query. The majority (69%) made more than two queries, with 48.8% making two to five, 14.5% making
six to ten and 5.7% making eleven to twenty queries. A small group (1.2%) made at least twenty-one
queries. The average number of queries per user was 6.29 over the 10-minute time frame.

67
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Table 7. Number of query words in each category

Wordlist Number of Query Words Percent of Query Words


West’s top 1000 words 48,028 24%
West’s top 2000 words 21,530 10.8%
Coxhead’s Academic Word List 81,088 40.2%
Off-lists 49,450 25%
Total 20096 100%

Table 8. Number of unique query words in each category

Wordlist Number of Words Number of Unique Query Percent of Unique Query


Words Words
West’s top 1000 words 4118 2756 66.9%
West’s top 2000 words 3708 2015 54.3%
Coxhead’s Academic Word 3107 2673 86%
List

The statistics for the types of queries are based on the four possible interactions shown in Figure
6, i.e., Retrieving Collocations (Interaction 1 in Figure 6), Viewing Extended Collocations (Interaction
2 in Figure 6), Viewing More Collocations (Interaction 3 in Figure 6) and Viewing Sentence Samples
(Interaction 4 in Figure 6). Table 10 shows that retrieving collocations makes up 69.8% of interactions,
followed by viewing extended collocations (15.3%), viewing sample sentences (8.2%), and viewing
more collocations (6.7%).
The duration of a series of queries is calculated by subtracting the time of the last query, e.g., (5)
viewing more collocations, and the time of the first query, e.g., (1)retrieving collocations. Noting that
the duration of a series is only an estimation of time spent on the website for users to make at least
two more queries (i.e., the duration is zero for series comprising only one query). The calculation
does not include the time the user spent after the last query, however (e.g., time spent on (5)view
more collocation). Table 11 provides the statistics for time spent on the FlaxLC website. 41.5% of
users spent less than 2 minutes in total, which corresponds to 48.8% of users who made two to five
queries as shown in Table 10. 38.5% (13.8% + 11.7% + 12.9%) of users spent a total of two to eight
minutes on the FlaxLC website whereas 20.1% of users managed to keep themselves busy within
the FlaxLC website for the total duration of 10 minutes. The average number for time spent on the
FlaxLC per user was 4.01 minutes over 10 minutes.

Table 9. Number of queries made by FlaxCL users within 10 minutes

Number of Queries Number of Users Percent of All Users


1 24804 29.8%
2-5 40465 48.8%
6-10 12029 14.5%
11-20 4719 5.7%
21 and above 973 1.2%
Total 82990 100%

68
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Table 10. Statistics of the types of queries

Types of Query Number of Queries Percent of Queries


Retrieving Collocations 247,486 69.8%
Viewing Extended Collocations 54,238 15.3%
Viewing More Collocations 23,635 6.7%
Viewing Sentence Samples 29,355 8.2%
Total 354,694 100%

Discussion on Query Analysis Findings


User query analysis helped us to understand how FlaxLC was used over the period of one year. The
results showed that the large majority of our users (65%) come from three English-speaking countries,
which echoes Tribble’s survey results on “who is using language corpora” where the majority of
respondents were English speakers (2015). Our data also indicate that there are a growing number of
users from non-English-speaking countries, particularly in Asia. Further research in the literature or
another round of analyses with FlaxLC in a year’s time could help confirm whether there is a wider
acceptance and increased practice of DDL in non-English-speaking countries.
Our users showed different preferences with the three databases—Wikipedia, BNC and BAWE.
The popularity of BAWE (despite its smaller size) and a large percentage (40%) of academic words in
query words suggest FlaxLC’s user base are academically oriented students in universities or colleges,
which is again in line with Tribble’s survey findings where nearly 80% of respondents were working
in higher education (2015). These findings with respects to user preferences for academic English
language support have prompted us to develop further dedicated academic collocation databases from
high quality texts such as journal articles with divisions in different disciplines (for example, Arts
and Humanities, Social Sciences, Physical Sciences and Life Sciences).
The examination of query terms revealed a rather low usage of multi-word queries (8.7%). This
would suggest that either the users were unaware of this functionality, or that they have used this
function partially but stopped using it because of unsatisfactory results. FlaxLC does not support
multi-word queries on function words like in, of, as, up—function words are ignored in the collocation
retrieval process. Alternatively, this may imply a difference in interpretation between our and our
users’ view of what constitutes a collocation. Also, of note were that 75% of query words belonged
to West’s General Service List (1953) and Coxhead’s Academic Word List (2000), indicating that
users are more likely to look up collocations of the words they have already learned. A compilation
of query words would be of great value to language teachers and researchers in understanding which

Table 11. Statistics of time spent on the FlaxLC website

Time Spent (Mins) Number of Users Percent of All Users


less than 2 24085 41.5%
2-4 7991 13.8%
4-6 6817 11.7%
6-8 7473 12.9%
8-10 11642 20.1%
Total 58008 100%

69
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

words students are having difficulty with or are interested in in terms of learning and employing
collocations.
The dramatic decrease of misspelling in query terms and the popularity of query formation
aids—word autocomplete, family words, synonyms, related words and antonyms—suggest that such
facilities are essential in a learner friendly corpus tool. A further route for development would be to
use word embedding technology 6 to provide semantically similar words in general, and academic
text in different disciplines.
Our user engagement measures were based on the assumption that in the duration of 10 minutes
the more time users spent on the website, the more queries they made the more engaged they were.
The average number of queries is 6.29 and the average time spent on FlaxLC is 4.01 minutes per
user. However, it is infeasible to compare our results with those in similar studies because different
durations would yield different results. The closest that we have found in the literature is Jansen et
al.’s study on an Internet search service engine called Excite 7 in which the average number of queries
per user is 2.84, but they did not specify the duration of time in the paper. Comparing the time spent
on FlaxLC with that on other websites is even harder because of the different nature of different
websites, for example, a user would most likely spend more time on YouTube or a gaming website
than on a university’s homepage. Nevertheless, we would argue that the users were engaging with
FlaxLC for nearly 30% of non-collocation retrieval queries (e.g., viewing extended collocations,
viewing more collocations, viewing sample sentences), suggesting that the users spent a reasonable
amount of time on examining the search results in FlaxLC, e.g. studying how a collocation is used
in context in the example sentences.
In summary, this initial user query analysis has provided valuable insights that would be hard
to gain from small and short-term user studies. In spite of the benefits, such an analysis has its
weaknesses. Our data are only based on observable artefacts of what the users actually did: in our
case, when they searched, how long they stayed on the website, what word(s) they searched for,
which facilities they used (synonyms, antonyms, related words, etc.), whether they looked up sample
sentences of a collocation and so on. We know much less about why they are doing what they are doing
and whether they are satisfied with the results of the system. We also did not have any information
about the users themselves (except their geographic regions) or about what they do with the search
results—which collocations were taken away and whether or not they were used and how. This
limitation must be considered in analyses and complemented by other techniques (e.g., think aloud
protocols in combination with user studies and surveys to gather perception data) to provide a more
complete understanding of user behavior.

CONCLUSION

Collocation learning has been recognized as one of the most challenging and important aspects of
language learning. Corpus consultation provides a promising way for learners to self-study collocations
in their own time. There are many corpus tools, with different user interfaces, some are free, and
some are not, some are available for students to use, but those that are designed for and dedicated to
language learners are few and far between. We have designed and built a learner friendly collocation
consultation system that utilizes learners’ existing familiarity with dictionaries and search engines.
FlaxCL offers great potentials in facilitating DDL in the language classroom, particularly with
collocation learning through corpus consultation. Here we present some ideas that are demonstrated
by Wu, et al. (2016) in preparing students for essay-writing, whereby students are asked to collect
collocations or related words that are germane to a specific writing topic. This approach is likened
to brainstorming wherein new and inspiring ideas may be encountered through the collocations and
related words functions. FlaxCL can also help students find the right words to express their ideas, for
example, finding appropriate verbs or adjectives for a particular noun, or adding adverbs to qualify
statements (i.e. hedging and boosting). To increase lexical range in student writing, students can use

70
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

FlaxCL to find synonyms, or other members of the same word family to avoid overusing the same
word (e.g. searching for benefit generates the family word beneficial, and also provides an expanded
range of verb usages such as benefit consumers, benefit greatly from, able to benefit from, and benefit
from the use of.)
FlaxLC currently houses three databases built from the British Academic Written English Corpus,
the British National Corpus and a Wikipedia corpus comprised of three million articles. New and
extensive databases with high-quality academic text in different disciplines will be developed in
response to the increasing preference for academic English corpora.
The user study we have conducted was small and simple, but it served well in assessing learner
interests in collecting collocations in FlaxLC. The results show a preference for longer collocations
and particular collocation patterns (e.g., verb/noun/adjective + infinitive-to + verb) with the group
of students that participated in our study. Further investigation is needed to find out whether the
results would vary with students with different English education and L1 backgrounds. By scaling
this research to include students from different backgrounds may lead to pedagogical implications
for designing and providing targeted collocation tasks for students of different English education
and L1 backgrounds. Longitudinal studies where FlaxCLS is embedded into language activities
(for example, vocabulary learning and writing) in a classroom would also serve to shed more light
on how teachers and students intend to and actually use FlaxCL in their DDL practice. We invite
participation from teachers and researchers, and believe that further participation from the language
education community will lead to further refinement of the system.
Our initial user query analysis, which has the benefit of being easy to capture at scale, not only
provided suggestions for improving usability and experience of our system, but it also revealed
interesting facts on how FlaxLC is used. Such analyses would provide valuable information and
suggestions for DDL researchers and language teachers when helping their students study collocations.
It could also go some way toward answering research questions like what makes a word difficult to
learn by examining the collocations that students have looked at, or whether query terms are different
according to different geographical regions. We have recently added new facilities to track user
interactions with the system in more detail to identify patterns of users’ query reformulation strategy
(i.e. site searching strategy). These additional facilities will also allow us to draw a comparison between
this analysis and a further one in a year’s time, along with a comparison between users from English
speaking and non-English speaking countries. The results would yield new and in-depth insights for
understanding user behavior in corpus consultation.

71
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

REFERENCES

Ackermann, K., & Chen, Y.-H. (2013). Developing the Academic Collocation List (ACL) - A corpus-driven
and expert-judged approach. Journal of English for Academic Purposes, 12(4), 235–247. doi:10.1016/j.
jeap.2013.08.002
Benson, M., Benson, E., & Ilson, R. (1986). The BBI combinatory dictionary of English: A guide to word
combinations. Amsterdam: John Benjamins. doi:10.1075/z.bbi1(1st)
Benson, M., Benson, E., & Ilson, R. (1997). The BBI dictionary of English word combinations. Amsterdam:
John Benjamins. doi:10.1075/z.bbi1(2nd)
Biber, D., & Gray, B. (2011). Grammatical change in the noun phrase: The influence of written language use.
English Language and Linguistics, 15(2), 223–250. doi:10.1017/S1360674311000025
Biber, D., Gray, B., & Poonpon, K. (2011). Should we use characteristics of conversation to measure grammatical
complexity in L2 writing development? TESOL Quarterly, 45(1), 5–35. doi:10.5054/tq.2011.244483
Boulton, A. (2012a). Hands-on / hands-off: Alternative approaches to data-driven learning. In J. Thomas & A.
Boulton (Eds.), Input, process, and product: Developments in teaching and language corpora (pp. 152–168).
Brno, Czech Republic: Masaryk University Press.
Boulton, A. (2012b). Wanted: Large corpus, simple software. No timewasters. A. Len ko- Szymańska. In TaLC10:
10th International Conference on Teaching and Language Corpora (pp. 1-6). Warsaw, Poland: Academic Press.
Boulton, A. (2015). Applying data-driven learning to the Web. In A. Leńko-Szymańska & A. Boulton (Eds.),
Multiple Affordances of Language Corpora for Data-driven learning (pp. 267–295). Amsterdam: John Benjamins.
Boulton, A., & Cobb, T. (2017). Corpus Use in Language Learning: A Meta-Analysis. Language Learning,
67(2), 348–393. doi:10.1111/lang.12224
Chambers, A., & O’Sullivan, Í. (2004). Corpus consultation and advanced learners’ writing skills in French.
ReCALL, 16(1), 158–172. doi:10.1017/S0958344004001211
Chan, T.-P., & Liou, H.-C. (2005). Effects of Web-based Concordancing Instruction on EFL Students’
Learning of Verb – Noun Collocations. Computer Assisted Language Learning, 18(3), 231–251.
doi:10.1080/09588220500185769
Chang, J.-Y. (2014). The use of general and specialized corpora as reference sources for academic English
writing: A case study. ReCALL: the Journal of EUROCALL, 26(2), 243–259. doi:10.1017/S0958344014000056
Chen, H.-J. H. (2011). Developing and evaluating a web-based collocation retrieval tool for EFL students and
teachers. Computer Assisted Language Learning, 24(1), 59–76. doi:10.1080/09588221.2010.526945
Conroy, M. (2010). Internet tools for language learning: University students taking control of their writing.
Australasian Journal of Educational Technology, 26(6), 861–882. doi:10.14742/ajet.1047
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213–238. doi:10.2307/3587951
Daskalovska, N. (2015). Corpus-based versus traditional learning of collocations. Computer Assisted Language
Learning, 28(2), 130–144. doi:10.1080/09588221.2013.803982
Dechert, H. W. (1984). Second language production: Six hypotheses. In H. W. Dechert, D. Mohle, & M. Raupach
(Eds.), Second language productions (pp. 211–230). Tübingen, Germany: Gunter Narr Verlag.
Firth, J. R. (1957). Modes of Meaning. Papers in Linguistics 1934-51. Oxford University Press.
Gao, Z.-M. (2011). Exploring the effects and use of a Chinese-English parallel concordancer. Computer Assisted
Language Learning, 24(3), 255–275. doi:10.1080/09588221.2010.540469
Geluso, J., & Yamaguchi, A. (2014). Discovering formulaic language through data-driven learning: Student
attitudes and efficacy. ReCALL, 26(2), 225–242. doi:10.1017/S0958344014000044
Halliday, M. A. K. (1993). Some grammatical problems in scientific English. In M. A. K. Halliday & J. R.
Martin (Eds.), Writing science (pp. 69–85). London: The Falmer Press.

72
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Hill, J., & Lewis, M. (Eds.). (1997). LTP Dictionary of Selected Collocations. LTP.
Jansen, B. J., & Spink, A. (2006). How are we searching the world wide web?: A comparison of nine search
engine transaction logs. Information Processing & Management, 42(1), 248–263. doi:10.1016/j.ipm.2004.10.007
Jansen, B. J., Spink, A., & Saracevic, T. (2000). Real life, real users, and real needs: A study and analysis
of user queries on the web. Information Processing & Management, 36(2), 207–227. doi:10.1016/S0306-
4573(99)00056-4
Johns, T. (1991). Should you be persuaded: Two examples of data-driven learning. English Language Research
Journal, 4, 1–16.
Leńko-Szymańska, A., & Boulton, A. (2015). Multiple affordances of language corpora for data-driven learning.
Amsterdam: John Benjamins Publishing Company. doi:10.1075/scl.69
Lewis, M. (2008). Implementing the lexical approach: Putting theory into practice. London: Heinle Cengage
Learning.
Li, Y., Zheng, Z., & Dai, H. K. (2005). Kdd cup-2005 report: facing a great challenge. SIGKDD Explor. Newsl.,
7, 91–99. http://doi.acm.org/10.1145/1117454.1117466
Lu, X. (2011). A corpus-based evaluation of syntactic complexity measures as indices of college-level ESL
writers’ language development. TESOL Quarterly, 45(1), 36–62. doi:10.5054/tq.2011.240859
Nation, I. S. P. (2013). Learning vocabulary in another language (2nd ed.). Cambridge, UK: Cambridge University
Press. doi:10.1017/CBO9781139858656
Nattinger, J. R., & DeCarrico, J. S. (1992). Lexical phrases and language teaching. Oxford, UK: Oxford
University Press.
Nesi, H., & Gardner, S. (2012). Genres across the Disciplines. Cambridge University Press.
Nesselhauf, N. (2003). The use of collocations by advanced learners of English and some implications for
teaching. Applied Linguistics, 24(2), 223–242. doi:10.1093/applin/24.2.223
O’Sullivan, I., & Chambers, A. (2006). Learners’ writing skills in French: Corpus consultation and learner
evaluation. Journal of Second Language Writing, 15(1), 49–68. doi:10.1016/j.jslw.2006.01.002
Oxford Advanced Learners’ Dictionary. (2000). (6th ed.). Oxford University Press.
Oxford Collocation Dictionary for Students of English. (2009). (2nd ed.). Oxford University Press.
Parkinson, J., & Musgrave, J. (2014). Development of noun phrase complexity in the writing of English
for Academic Purposes students. Journal of English for Academic Purposes, 14, 48–59. doi:10.1016/j.
jeap.2013.12.001
Ross, N. C. M., & Wolfram, D. (2000). End user searching on the internet: An analysis of term pair topics
submitted to the excite search engine. Journal of the American Society for Information Science, 51(10), 949–958.
doi:10.1002/1097-4571(2000)51:10<949::AID-ASI70>3.0.CO;2-5
Shei, C. C. (2008). Discovering the hidden treasure on the Internet: Using Google to uncover the veil of
phraseology. Computer Assisted Language Learning, 21(1), 67–85. doi:10.1080/09588220701865516
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford, UK: Oxford University Press.
Tribble, C. (2015). Teaching and language corpora: Perspectives from a personal journey. In A. Leńko-
Szymańska & A. Boulton (Eds.), Multiple Affordances of Language Corpora for Data-driven learning (pp.
37–62). Amsterdam: John Benjamins. doi:10.1075/scl.69.03tri
Varley, S. (2009). I’ll just look that up in the concordancer: Integrating corpus consultation into the language
learning environment. Computer Assisted Language Learning, 22(2), 133–152. doi:10.1080/09588220902778294
Vyatkina, N. (2016). Data-driven learning of collocations: Learner performance, proficiency, and perceptions.
Language Learning & Technology, 20(3), 159–179.

73
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Wei, Y. (1999). Teaching collocations for productive vocabulary development (Report No. FL 026913).
Developmental Skills Department, Borough of Manhattan Community College, City University of New York.
West, M. (1953). A general service list of English words. Longman, Green & Co.
Wu, S., Franken, M., & Witten, I. H. (2009). Refining the use of the web (and web search) as a language teaching
and learning resource. Computer Assisted Language Learning, 22(3), 249–268. doi:10.1080/09588220902920250
Wu, S., Li, L., Witten, I. H., & Yu, A. (2016). Constructing a collocation learning system from the Wikipedia
corpus. International Journal of Computer-Assisted Language Learning and Teaching, 6(3), 18–35. doi:10.4018/
IJCALLT.2016070102
Yeh, Y., Li, Y.-H., & Liou, H.-C. (2007). Online synonym materials and concordancing for EFL college writing.
Computer Assisted Language Learning, 20(2), 131–152. doi:10.1080/09588220701331451
Yoon, H. (2008). More than a linguistic reference: The influence of corpus technology on L2 academic writing.
Language Learning and Technology. Retrieved from http://llt.msu.edu.ezproxy.waikato.ac.nz/vol12num2/yoon.
pdf
Yoon, H., & Hirvela, A. (2004). ESL student attitudes toward corpus use in L2 writing. Journal of Second
Language Writing, 13(4), 257–284. doi:10.1016/j.jslw.2004.06.002

ENDNOTES
1
BYU-BNC (http://corpus.byu.edu/bnc/) is British National Corpus in Brigham Yong Corpora (BYU)
collections.
2
http://flax.nzdl.org/greenstone3/flax?a=fp&sa=collAbout&c=collocations
3
http://lexically.net/downloads/BNC_wordlists/e_lemma.txt
4
https://wordnet.princeton.edu/wordnet/
5
This can be experienced by visiting http://flax.nzdl.org/greenstone3/flax?a=fp&sa=collAbout&c=collo
cations and typing in “advantage” in the query box.
6
A technique that quantifies and categorizes semantic similarities between words based on their distributional
properties in a large body of text.
7
http://www.excite.com/

74
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Appendix A

Table 12. Collocation patterns

Collocation Pattern Example


verb + noun(s) cause problems
verb + noun + noun tackle the root cause of
verb + adjective + noun(s) take a full responsibility for
verb + preposition + noun(s) result in an increase in
gerund verb + noun the underlying concept
noun + noun tax increase
noun + of + noun concept of power
adjective(s) + noun(s) abstract concept
adjective + noun + noun a solar energy system
adjective + adjective + noun(s) intensive qualitative research
adjective + and/but + adjective + noun(s) economic and social development
noun + to + verb ability to influence
noun + preposition + noun difference in opinion
adjective + to + verb crucial to understand
adjective + preposition + verb positive in their attitude
adverb + adjective seriously addicted
verb + pronoun + adjective make it easy
verb + to + verb cease to amaze
adverb + verb beautifully written
verb + adverb rely heavily on

75
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Appendix B

Table 13. Top 100 query words grouped by word lists(Query words in West’s top 1000 words.)

problem 682 develop 181 listen 130 possibility 105


support 423 ability 177 popular 130 growth 105
influence 383 quality 172 value 129 population 104
effect 378 pressure 171 standard 127 relative 104
concern 306 demand 169 company 123 apply 104
increase 281 time 168 decision 123 condition 103
important 277 look 167 take 122 efficient 100
development 274 result 165 however 119 life 99
opportunity 269 level 163 help 119 talk 99
knowledge 265 use 159 need 118 success 98
work 260 study 158 social 116 get 97
reason 240 provide 154 claim 115 object 96
explanation 238 different 151 measure 115 successful 95
language 234 payment 146 law 114 wealth 95
cause 233 profit 143 family 112 purpose 95
experience 224 effective 142 market 111 substantial 94
poverty 212 several 142 feel 111 good 93
reduce 202 question 142 cost 110 represent 93
inequality 201 advantage 138 doubt 110 reduction 93
go 196 example 135 sense 109 deep 92
relationship 192 difference 134 show 108 idea 92
consider 191 record 134 lack 108 system 92
change 185 opinion 134 situation 106 part 90
effort 185 interest 133 studying 106 society 90
expression 184 make 132 suggest 105 home 90

Table 14. Query words in West’s top 2000 words.

education 299 essential 93 practice 70 pain 50


solution 255 comparison 92 healthy 68 especially 50
information 217 crime 89 behaviour 68 resistance 50
competition 194 discuss 88 treatment 67 remedy 50
risk 191 compete 87 international 65 punishment 50
performance 178 discussion 86 rain 63 commercial 49
improve 173 connection 85 solve 61 excessive 49
reputation 164 management 85 violence 61 afford 48
attention 150 disease 84 confident 60 advice 48
health 143 program 83 combine 60 suitable 46
critical 141 prejudice 83 conscious 59 broadcast 46
responsibility 139 advertisement 82 satisfaction 58 weight 46
tend 127 preference 82 band 58 manage 46
decrease 124 key 82 suspicious 58 extreme 46
confidence 113 encourage 82 explore 58 boundary 46
argue 110 threat 80 severe 58 oppose 45
gap 108 improvement 80 recommendation 54 imitate 45
argument 107 reflect 79 refer 53 debt 45
aim 106 discipline 79 cursed 52 origin 44
skill 103 behavior 79 complain 52 relief 44
compare 102 request 77 double 52 connect 44
avoid 95 tendency 77 interfere 52 rapid 44
responsible 94 damage 75 frequency 51 hate 43
government 94 examine 74 convenient 50 satisfy 43
balance 93 perform 72 habit 50 persuade 43

76
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Table 15. Query words in Coxhead’s 570 academic words

impact 721 distinction 263 perspective 226 respond 191


function 655 academic 254 emerge 225 sufficient 191
evidence 580 attitude 251 design 225 involve 186
significant 567 assumption 250 similar 219 generate 186
research 564 contribution 248 context 218 indicate 185
issue 493 role 245 goal 216 culture 185
benefit 447 appropriate 245 evaluate 216 integrate 184
policy 415 process 245 method 215 facilitate 184
approach 372 factor 244 maintain 214 motivation 183
concept 357 achieve 244 status 214 stress 183
access 349 area 242 decline 214 economy 182
consequence 342 structure 241 financial 211 intervention 182
challenge 342 focus 240 ensure 210 feature 181
strategy 329 perceive 239 dimension 210 diminish 180
investigate 319 despite 238 acknowledge 208 assume 178
environment 311 affect 235 occur 200 conduct 176
economic 309 relevant 235 fundamental 199 hypothesis 174
analysis 307 emphasis 234 rely 197 commitment 174
contribute 300 category 234 conflict 197 complex 173
distribution 299 attribute 233 debate 197 income 172
potential 285 majority 230 automate 196 major 171
promote 284 enhance 229 technology 193 alternative 170
aspect 283 derive 229 investment 193 inevitable 169
vary 277 analyse 227 sustainable 193 eliminate 168
perception 266 aware 226 priority 192 require 167

Table 16. Query words not in West’s and Coxhead’s wordlist

disparity 227 uneven 76 stereotype 59 comply 49


folklore 194 dissipate 76 impression 58 diet 49
collocation 148 cognitive 76 tackle 58 merger 49
tariff 126 duplicity 76 pollution 58 drawback 49
convey 125 budget 75 obligation 57 boost 47
egalitarian 119 dilemma 72 correlation 57 negotiation 47
barrier 118 vulnerable 71 corpus 57 cumulative 46
genuine 106 routine 71 impress 57 acceleration 46
mobility 103 obstacle 70 lifestyle 56 reform 46
emission 101 profound 69 protest 56 endure 45
tantrums 101 authentic 68 absorb 55 atmosphere 45
ample 98 crisis 68 alleviate 54 vast 45
elaborate 97 career 67 overhaul 53 consolidation 44
engage 96 expenditure 67 entrenched 53 deterioration 44
burden 95 demise 65 campaign 52 discourse 44
mitigate 95 collaboration 64 authenticity 52 feasible 44
backlash 93 interview 63 abuse 52 deficit 43
attest 91 misunderstanding 63 intrepid 51 reckon 43
collaborate 87 criticism 62 bankrupt 51 household 42
negotiate 87 tolerate 62 conspire 51 disparities 42
infection 84 loophole 62 transparency 50 nutrition 42
dispute 79 pension 60 urban 50 nutshell 42
species 78 democracy 60 rap 50 competitive 42
heath 78 vital 60 feedback 49 deploy 41
asset 78 well-known 60 efficiently 49 prosperity 41

77
International Journal of Computer-Assisted Language Learning and Teaching
Volume 9 • Issue 2 • April-June 2019

Shaoqun Wu is a senior lecturer in the computer science department at University of Waikato, New Zealand. Her
research interests include computer assisted language learning, mobile language learning, supporting language
learning in MOOCs, digital libraries, natural language processing, and computer science education.

Alannah Fitzgerald is postdoctoral research fellow with the FLAX language project at the University of Waikato
in Aotearoa. She is an open education practitioner and researcher working across formal and non-formal higher
education. Her research interests include Computer-Assisted Language Learning, Data-Driven Learning, English
for Specific and Academic Purposes, Massive Open Online Courses, Open Educational Practices, and Self-
Regulated Learning.

Alex Yu is a senior lecturer at Centre for Business, Information Technology and Enterprise at Waikato Institute of
Technology. He is also one of the core developers of the FLAX project. His research interests include computer
assisted language learning, MOOCs, mobile language learning, and data mining.

78

You might also like