Research Paper Text Mining
Research Paper Text Mining
uk/research-matters/
© UCLES 2015
38 | R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5
The main difficulty is that, often, the hidden structure of natural From basic word counts to sentiment analyses
language is highly ambiguous. Although this might jeopardise the Some of the applications of TM require very basic statistics, frequencies
outcome, developments in NLP have led to a high degree of success in for instance. Counting the occurrence of one or more words from a
certain tasks. NLP enables us to (JISC, 2008): document is the most common TM application, but it does require new
● classify words into grammatical categories (e.g., nouns, verbs); ways to visualise this kind of data. For example, Wordle, a free tool
● disambiguate the meaning of a word, among the multiple meanings available online (http://www.wordle.net/) generates tag clouds of the
that it could have, on the grounds of the content of the document; words contained in a document (Feinberg, 2010). The size of each word is
● parse a sentence, that is, perform a grammatical analysis that enables proportional to its relative frequency in the document (similar to a
us to generate a complete representation of the grammatical bubble plot).
structure of a sentence, not just identify the main grammatical The technological advances that have fuelled TM development have
elements in a sentence. not just inspired new data visualisations, but also stimulated the
collection of new ‘textbases’, such as Project Gutenberg and Google
During this stage of TM, the linguistic data about text are extracted from, Books. For instance, digitising and archiving books allows us to calculate
and marked-up to, the documents which still hold an unstructured form the frequency of a word in a book, or in all the books published in a
of data. specific year or to visualise the occurrence of certain words over time.
For books available in Google Books, Figure 1 gives an example of the
Information Extraction
occurrence of the words ‘information’ and ‘news’ in books published
In order to be mined as any other kind of data, the unstructured natural during the last century. Whilst the word ‘news’ appears to have been
language document must be turned into data in a structured form. This steadily used by authors over the last century, the word ‘information’
stage is called Information Extraction and it is the data generated by experienced a notable increase: from about the same level as ‘news’ in
NLP systems. The most common task performed during this stage is the the early 1900s, to six times more than ‘news’ in the year 2000.
identification of specific terms, which may consist of one or more words,
as in the case of scientific research documents containing many complex 0.0450%
0.0400%
multi-word terms. information
Percentage of words in
0.0350%
Google Books sample
0.0150%
‘analysing’, as the aim is to draw useful information from the text data in Figure 1: Searches for the words ‘information’ and ‘news’ in Google Books
order to build up new knowledge. To do this, given that data are now in a (digitalised books originally published between 1900 and 2000)
structured form, it is possible to make use of standard statistical Image sourced from Google Books Ngram Viewer.
Retrieved from https://books.google.com/ngrams
procedures and techniques applied to text data that are now in
structured form.3
Word counts and the availability of large-scale ‘textbases’ give the
opportunity to analyse the evolution of literary styles and trends over
Applications of Text Mining time and across countries. This kind of analysis belongs to a new field of
study known as ‘culturomics’ (Ball, 2013). For example, in a recent study,
The first applications of TM surfaced in the mid-1980s.4 However its a group of researchers mined a sample of 7,733 works obtained from the
growth has been led by technological advances in the last ten years. TM Project Gutenberg Digital Library written by 537 authors after the year
has being increasingly employed in applied research in different areas 1550 (Hughes, Foti, Krakauer,& Rockmore , 2012). They focused on the
(such as epidemiology, economics and education) as well as for business- use of 307 content-free words (e.g., prepositions, articles, conjunctions
related purposes, especially for gaining market and consumer insights and and common nouns) claiming that these words provide a useful stylistic
to develop new products. The techniques of TM are common to both fingerprint for authorship and can be used as a method of comparing
academic research and business-oriented analytics. author styles. For each author a similarity index with every other author
was computed. This index, based on the occurrences of each content-free
word considered in the study, was used to exploit temporal trends in the
usage of content-free words. Their primary finding was that authors tend
to have important stylistic connections to other authors closer to them
in time, but not necessarily to immediate contemporaries. They noticed
3. Among the most common statistical packages used by researchers, the text analytics tools are that, for books published within three years of each other, the similarity
‘Text Miner’ and ‘Enterprise Miner’ (SAS), ‘TM – Text Mining Infrastructure’ (R) and ‘Modeler’ index is very high, but slightly smaller than the one shown for books
(SPSS).
published within ten years of each other. For books published with a
4. See, for example, the Content Analysis of Verbatim Explanations Research project.
http://www.ppc.sas. upenn.edu/cave.htm temporal distance of more than ten years, the similarity index decreased
R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5 | 39
until reaching a stable value for books published with a temporal distance
of 350 years. 1.0
Joy-Sadness (z-scores)
authors. A massive amount of text data using digital versions of nearly
3,500 books was processed to investigate how books were connected to
one another on criteria such as frequency of words, choice of words and 0.0
overarching subject matter (Jockers, 2013). Each book was then affixed
with unique attributes and plotted graphically. Figure 2 shows the books
analysed from the late 1700s to the early 1900s. The books plotted closer –0.5
to each other represent a close relationship in terms of styles and
themes. Figure 2 highlights the example of Herman Melville’s Moby Dick
published in 1851 which appears here as an outlier from much of the –1.0
literary work of the period while still being related to several works by
James Fenimore Cooper (Sea Lions published in 1849 and The Crater
1900 1920 1940 1960 1980 2000
published in 1847).
Year
Moby Dick
40 | R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5
a) higher sale price and b) lower sale price. Table 1 gives the five terms for An example of such trends is given in Figure 4. It shows the
both (in order of their association with price). The more expensive houses comparison of text searches in Google for the terms ‘OCR’, ‘Edexcel’ and
were described using words which were all related to the physical ‘AQA’ (the names of three awarding bodies based in England, Wales and
description of the house such as ‘granite’ and ‘maple ’. Unexpectedly, Northern Ireland) from January 2011 to September 2014.7 The searches
words such as ‘fantastic’ and ‘charming’ were used more often for for the three awarding bodies follow a similar pattern to each other
cheaper houses. The authors suggest that these words are used as a sort which, not unexpectedly, depict a seasonal component: the two peaks are
of real-estate agent code to attract potential customers for a house in June and January of year each (except for January 20148), when the
which doesn’t have many saleable attributes. majority of students sit the exams, whilst August has fewer searches,
when schools are closed. During examination sessions AQA was the most
Table 1: Terms used in USA real-estate adverts and their association with house searched, while OCR had the highest number of searches from
price (Dubner & Levitt, 2005). September to December.9 Google Trends also provides a list of related
searches, that is, popular search terms that are associated with the term
Five terms associated with Five terms associated with searched. In the example given here, for all three awarding bodies, the
higher price lower price
most related search was their name followed by the term ‘past papers’
Granite Fantastic (e.g., ‘OCR past papers’). The second most frequent related search was the
State-of-the art Spacious name of the awarding body followed by ‘GCSE’ (e.g., OCR GCSE). We also
Corian® ! observed that while the most searched subject for OCR and AQA was
Maple Charming
Biology, for Edexcel it was Mathematics.
Gourmet Great neighbourhood
100
Jan 2
13
Ma 13
3
13
No 13
Jan 3
Ma 14
Ma 14
4
14
4
11
Ma 11
11
Jan 1
12
Ma 12
2
12
No 12
01
01
01
01
01
01
01
01
01
Word pattern recognition has also been applied to everyday working
20
20
20
20
20
20
20
20
y2
p2
r2
y2
p2
v2
r2
r2
y2
p2
v2
r2
y2
p2
v2
Jul
Jul
Jan
Jul
Jul
Ma
Ma
Ma
No
Se
Se
Se
Se
One of the most familiar applications of TM technology and machine Image sourced from Google Trends. Retrieved from http://www.google.com/trends
learning techniques is Google Translate, a free, multilingual translation
service provided by Google Inc. to translate written text from/into 63 It has been shown that the number of text queries that users enter
languages. Google Translate is based on a large scale statistical analysis, into web search engines such as Google and Yahoo can be used for
rather than traditional grammatical rule-based analysis. To generate a predictive modelling for forecasting values of a number of measures of
translation, Google Translate looks for patterns in hundreds of millions of interest. Researchers in epidemiology discovered that search requests for
documents that have already been translated by human translators and terms like ‘flu symptoms’ and ‘flu treatments’ were a good predictor of
are available on the web. This process of seeking patterns in large the number of patients who, in the period 2004–2008, required access to
amounts of text is called ‘statistical machine translation’ (Och, 2005). 6 USA hospital emergency rooms in the next two weeks (Polgreen, Chen,
Clearly, the more human-translated documents that Google Translate can Pennock, Nelson & Weinstein, 2008; Ginsberg et al., 2009). With reference
analyse in a specific language, the better the translation quality will be. to 2013, it was reported that these web searches were predicting more
than double the proportion of doctor visits for influenza-like illness that
Publicly available data and predictive modelling were actually recorded. This was probably caused by a change in the
With the advent of new technologies, a source of data is not just a Google search algorithm (Lazer, Kennedy, King, & Vespignani, 2014).
document for TM, the search for that document itself can provide useful Although this discovery can undermine the suitability of web searches as
insights. In the case of documents available online, web searches through a predictive method, it has been proven to be a good source of
search engines can be informative. Google, for example, set up Google
Trends, which allows internet users to easily access metrics on Google 7. Google Trends does not provide data on the access to the website (which is something that
Google Analytics does, though this is not publicly accessible). So the data plotted in Figure 4 are
searches.
not ‘visits’ to the three awarding bodies’ websites, but only ‘searches’. Moreover, data provided
does not show the actual volume of searches, but only an indicator estimated in relation to the
maximum value of searches across the comparison which is set to 100.
5. Significant (?) relationships everywhere. Language Log. Retrieved from: 8. It should be noted that in 2014, there was no January exam sitting.
http://languagelog.ldc.upenn.edu/nll/ ?p=4686#more-4686 9. Note that the results might have been different if, for instance, ‘Pearson’ or ‘Pearson Edexcel’ had
6. See also the webpage of the Google Research team at http://research.google.com/pubs/Machine been used instead of ‘Edexcel’. Pearson has been the parent company of Edexcel since 2003. In
Translation.html 2010, the legal name of the Edexcel awarding body became Pearson Education Limited (Pearson).
R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5 | 41
information when combined with traditional sources of data. Web search business problem. Furthermore, from the analysis of text data, it was
data combined with official statistics have been extensively used to possible to discover that, during the case study, students learnt how to
predict the unemployment rate in different countries such as the US use a language more similar to the one used in the real business world.
(Ettredge, Gerdes, & Karuga, 2005; D’Amuri & Marcucci, 2010), Germany In an evaluation of this experiment, students affirmed that they liked
(Askitas & Zimmermann, 2009) and Israel (Suhoy, 2009). It has also been this new teaching approach and would like to see more of it at their
shown that web search data employed as an explanatory variable, along schools as they found it very applicable to real life (Theroux, 2009).
with the previous historical trends of the dimension of interest, can A second example of the use of TM to gather insights on learners’
sensibly improve short-term predictions of other social and economic cognition is a study aimed at analysing students’ progression in a
indicators such as inflation (Guzman, 2011). Therefore, predictive computer programming class. In this study, a software package was used
modelling could also enable central banks and other national and to gather data during a programming assignment from nine learners
international agencies to improve the timing and the accuracy of the (Blikstein, 2011). The software allowed researchers to build a 1.5 GB
policy measures they publish to inform policy makers. It can also be dataset of 18 million lines of events (such as keystrokes, code changes,
applied to economic metrics for business-related purposes and analysing error messages and actual coding snapshots). An in-depth automated
customer insights. exploration of each student’s coding strategies summarised by this
Evidence has shown that web search queries “…can be useful leading mixture of structured and text data was compared with those of other
indicators for subsequent consumer purchases in situations where students. The author discovered that error rates progressed in an ‘inverse
consumers start planning purchases significantly in advance of their parabolic shape’. This means that, initially, students made a lot of
actual purchase decision” (Choi & Varian, 2012). For instance, search mistakes, but they demonstrated that they were able to learn from them
engine data related to housing search enquiries has been shown to be a through problem-solving and progressed until they had completed their
more accurate predictor of house sales in the next quarter than the assignment. Although this is a small-scale study and it is not possible to
forecasts provided by real estate economists (Wu & Brynjolfsson, 2013). make any claims about statistical significance, it suggests that using a
Web search queries have also been successfully employed to improve the sophisticated TM application might lead to a better understanding of
predictability of motor vehicle demand and holiday destinations (Choi & students’ coding styles and sophisticated skills such as problem-solving.
Varian, 2012). These are applications of the terms attributed to Choi and An extensive use of the recent developments in NLP has also been
Varian’s – ‘contemporaneous forecasting’ or ‘nowcasting’, because they employed to automatically detect secondary students’ mental models in
can help in ‘predicting the present’, rather than the future (Choi & Varian, order to gain a better understanding of their learning processes. In an
2012). experiment students were asked to write short paragraphs about the
The use of predictive modelling has also been adapted by online human circulatory system in order to recall knowledge about the topic.
retailers to gain customer insights. Amazon and Netflix Using an intelligent tutoring system (MetaTutor) that teaches students
recommendations, for example, rely on predictive models of what book self-regulatory processes during learning of complex Science topics and
or film a customer might want to purchase on the basis of their history applying TM techniques, researchers explored which particular machine
of enquiries to the website or similar purchases made by other customers learning algorithm would enable them to accurately classify each student
(Einav & Levin, 2014). In general, online advertising and marketing tends in terms of their content knowledge (Rus & Azevedo, 2009). Mental
to rely on automated predictive algorithms that target customers who models represent an expanding field of research among cognitive
might be interested in responding to offers. psychologists and are aimed at better understanding how well an
Predictive modelling based on text data extends well beyond the individual organises content in meaningful ways. TM allows researchers to
online world. One of the most famous applications is the development of undertake analysis that can reveal inaccuracies and omissions that are
algorithms that make use of text data contained in different forms of crucial for deep understanding and application of course material, thus
communication (e.g., mobile texts and emails) to detect terrorist threats informing improvements in course design.10
and to identify fraudulent behaviour in healthcare and financial services A number of systems using TM have been developed for automated
(Einav & Levin, 2014). marking of essays and short, free text responses (for an example of the
latter see Sukkarieh et al., 2003). Some of the most widely used
automated essay marking systems available in the market include: Project
Essay Grader, Intelligent Essay Assessor, E-rater, Criterion, IntelliMetric,
Applications of Text Mining in education
MY Access and Bayesian Essay Test Scoring System. They have been
The benefits offered by the interaction of text and other data analytics in developed to reduce time and cost and improve reliability and
improving learning processes are already being valued by education generalisability of the process of assessment in low-stakes classroom
practitioners as well as by learners themselves. tests, as well as for large-scale assessment such as national standardised
The first example is the implementation of an experimental real-time examinations. The accuracy and reliability of these automated systems
case study in a business course. Lecturers made use of internet-based have been investigated by educational researchers in the last fifteen
software to facilitate written communication among students, teachers years. Along with the benefits of using TM, some of its disadvantages
and the case organisation. In this way, it was possible to gather a large such as the lack of human interaction and the need for a large corpus of
quantity of text data containing all the email communication among sample texts to train the system, have also been reported (Dikli, 2006).
students and the organisation involved in the case study. Applying simple Automated essay marking systems do not really understand the texts as
text analytics on real-time written communication, such as counting of
specific words, researchers found that, by the end of this experimental
10. For more details on mental model assessment in education see
teaching approach, students had increased their understanding of a live http://mentalmodelassessment.org/
42 | R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5
humans do, so it is not possible to affirm that they emulate the human analyses in order to get better insights on assessment. For example,
marking process. Notwithstanding, automated essay marking systems Worsley & Blikstein (2011) examined students’ dialogues along with
show high agreement rates with human markers; and their supporters other qualitative and quantitative data to develop predictors for student
advocate that the main role of these systems today is not to replace expertise in the area of Engineering design. By leveraging the tools of
teachers and assessors, but to assist them, incorporating these systems as machine learning, NLP, speech analysis and sentiment extraction, the
a supplementary marker, especially in large-scale writing assessments authors identified a number of distinguishing factors of learners at
(Monaghan & Bridgeman, 2005; Kersting, Sherin & Stigler, 2014). different levels of expertise. According to the study, these kinds of
A particular example of automated essay marking is the tool findings motivate further research in this field and the development of a
developed by a team of researchers at Maastricht University to stimulate new paradigm for the evaluation of learner knowledge construction.
students to become active and collaborative learners. It has been used in
Statistics courses to assess students on their understanding of course
content. It makes use of advanced NLP and Latent Semantic Analysis Discussion
algorithms that can be used in automatic marking of the texts. Mining
students’ essays, researchers were easily able to automatically The key advantage provided by TM is the opportunity to exploit text
discriminate between the reference book chapter text and the documents records, on a very large scale. In this article we have briefly described the
of the students. However, it is less clear whether this tool is able to techniques of TM and some of its applications.
discriminate students from one another (Imbos & Ambergen, 2010). TM has a variety of potential applications in the field of education.
Despite its weaknesses, marking essays automatically continues to In formative and summative assessment, for instance, it could be used to
attract the attention of schools, universities, assessment organisations, understand trends in vocabulary usage over time and the use of spelling
researchers and educators. Although it might be difficult for these and punctuation. To date, these applications have been carried out by
systems to supersede human markers, TM can be employed to support teachers and assessment experts without using advanced techniques
human markers as a second or third marker (see, for instance, Landauer, such as TM, but TM allows the possibility of implementing these
2003 and Attali & Burstein, 2006). The Centre for Digital Education (CDE) applications on a more comprehensive scale. The developments in NLP
reported that, in the USA, around $20 billion was spent on public allow educational professionals to analyse the language structure of a
education in Information Technology in 2012, with an increase of 2 per vast amount of text documents in just a few minutes, plus the ongoing
11
cent from the previous year . The awareness of the potential of TM and developments in this field could result in an increase in the accuracy of
DM in, for instance, formative assessment, has led McGraw-Hill to the findings.
develop two different tools, Acuity Predictive Assessment and Acuity The availability of novel data could lead, at least in principle, to novel
Diagnostic Assessment, aimed at informing teachers and learners about measurement and research designs to address old and new research
their performance and how to improve it (CDE, 2014). questions. However, working with very large, rich and new kind of
These tools can be employed for formative assessment. Predictive datasets, it might not be straightforward to figure out what questions the
modelling of text data can provide an early indication of how students data could answer accurately. Asking the right question might be more
will perform on a standardised test. It allows assessment of the gap important now than ever (Einav & Levin, 2014). Exploiting large text
between what students are expected to know and what they actually datasets without a proper research question might lead to a significant
know. It can also provide evidence regarding which area of the syllabus waste of resources.
they have to focus on to improve their performance (West, 2012). Also, More heterogeneous and in-depth data could allow researchers to
more advanced analysis could be informative to teachers about which move from methods that allow the estimation of average relationships in
particular teaching techniques are more efficient for specific students and the population towards differential effects for specific subpopulations of
the best ways to tailor the learning approach to them (Bienkowski, Feng interest. This could mean looking at particular categories of students,
& Means, 2012). defined by their specific background, level of achievement and other
Students’ reading comprehension, for example, has been the object of characteristics of interest. TM is an expanding field with the potential to
a study based on the use of intelligent tutoring software. The analysis of support innovative areas of research. With careful research designs and
data such as students’ reading mistakes and word knowledge gathered proper methods, TM could make a salient contribution to educational
through a speech recognition tool showed that re-reading an old story research.
helped pupils learn half as many words as reading a new story (Beck &
Mostow, 2008). An online tool called WebQuest provides activities
References
designed for teachers to train pupils in skills such as information
Acerbi, A., Lampos, V., Garnett, P., & Bentley, R. A. (2013). The expression of
acquisition and evaluation of online materials. Students who have
emotions in 20th century books. PloS one, 8(3), e59030.
experienced these kinds of activities have reportedly enjoyed the
Ananiadou, S., Chruszcz, J., Keane, J., McNaught, J., & Watry, P. (2005). The
collaborative and interactive nature of the activities (Perkins & McKnight,
National Centre for Text Mining: Aims and Objectives. Ariadne, 42. Retrieved
2005).
from: http://www.ariadne.ac.uk/issue42/ananiadou.
Predictive modelling in educational assessment has been mainly based
Anawis, M. (2014). Text Mining: The Next Data Frontier. Scientific Computing.
on numeric data (e.g., days of truancy, overall grades and disciplinary
Retrieved from: http://www.scientificcomputing.com/blogs/2014/01/text-
problems). However, text data could be used to enable more in-depth mining-next-data-frontier.
R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5 | 43
Attali, Y. & Burstein, J. (2006). Automated Essay Scoring With e-rater® V.2. The Jockers, M. (2013). Macroanalysis: Digital Methods and Literary History.
Journal of Technology, Learning, and Assessment, 4(3). Champaign: University of Illinois Press.
Ettredge, M., Gerdes, J., & Karuga. G. (2005). Using web-based search data to Kersting, N. B., Sherin, B. L. & Stigler, J. W. (2014). Automated Scoring of Teachers’
predict macroeconomic statistics. Communications of the ACM, 48(11), 87–92. Open-Ended Responses to Video Prompts: Bringing the Classroom-Video-
Analysis Assessment to Scale. Educational and Psychological Measurement,
Ball, P. (2013, 21 March). Text mining uncovers British reserve and US emotion.
74(6), 950–974.
Nature. Retrieved from: http://www.nature.com/news/text-mining-uncovers-
british-reserve-and-us-emotion-1.12642. Landauer, T. K. (2003). Automatic Essay Assessment, Assessment. Education:
Principles, Policy & Practice, 10(3), 295–308.
Beck, J., & Mostow, J. (2008). How Who Should Practice: Using Learning
Decomposition to Evaluate the Efficacy of Different Types of Practice for Lazer, D., Kennedy, R., King, G. & Vespignani, A. (2014, 14 March). The Parable of
Different Types of Students. In B. Woolf, E. Aïmeur, R. Nkambou & S. Lajoie Google Flu: Traps in Big Data Analyisis. Science, 343(6176), 1203–1205.
(Eds.), Intelligent Tutoring Systems, (5091), 353–362. Springer Berlin Heidelberg. Retrieved from: http://www.sciencemag.org/content/343/6176/1203.
Bienkowski, M., Feng, M., & Means, B. (2012). Enhancing Teaching and Learning Lohr, S. (2012, 11 February). The Age of Big Data. The New York Times. Retrieved
Through Educational Data Mining and Learning Analytics: An Issue Brief. U.S. from: http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-
Department of Educational, Office of Educational Technology. Retrieved from: in-the-world.html?pagewanted=all&_r=0.
http://tech.ed.gov/wp-content/uploads/2014/03/edm-la-brief.pdf
Manning, C.D., & Schütze, H. (1999). Foundations of Statistical Natural Language
Blikstein, P. (2011). Using learning analytics to assess students' behavior in open- Processing. Boston: MIT press.
ended programming tasks. Paper presented at the Proceedings of the 1st
Monaghan, W., & Bridgeman, B. (2005). E-rater as a Quality Control on Human
international conference on learning analytics and knowledge.
Scorer. ETS RD Connections. Retrieved from: http://www.ets.org/Media/
Centre for Digital Education (CDE) (2013). Big Data, Big Expectations. The Promise Research/pdf/RD_Connections2.pdf.
and Practicability of Big Data for Education. The Centre for Digital Education.
Och, F.J. (2005). Statistical Machine Translation: Foundations and Recent Advances.
Retrieved from: http://www.centerdigitaled.com/paper/259374351.html
Retrieved from: http://www.mt-archive.info/MTS-2005-Och.pdf.
Ceron, A., Curini, L., Iacus, S. M., & Porro, G. (2014). Every tweet counts? How
Perkins, R., & McKnight, M.L. (2005). Teachers' attitudes toward WebQuests as a
sentiment analysis of social media can improve our knowledge of citizens’
method of teaching. Computers in the Schools, 22(1–2), 123–133.
political preferences with an application to Italy and France. New media and
society, 16(2), 340–358. Polgreen, P.M., Chen, Y., Pennock, D. M., Nelson, F. D., & Weinstein, R. A. (2008).
Using Internet Searches for Influenza Surveillance. Clinical infectious diseases,
Choi, H., & Varian, H. (2012). Predicting the Present with Google Trends. Economic
47(11), 1443–1448.
Record, 88(1), 2–9.
Rogers, S. (2011, 28 July). Data journalism at the Guardian: what is it and how do
D’Amuri, F., & Marcucci, J. (2010). “Google it!” Forecasting the US unemployment
we do it? The Guardian Datablog. Retrieved from:
rate with a Google job search index. ISER Working Paper Series 2009–32.
http://www.theguardian.com/news/datablog/2011/jul/28/data-journalism
Institute for Social & Economic Research (ISER).
Rus, V., Lintean, M., & Azevedo, R. (2009). Automatic Detection of Student Mental
Dhawan, V., & Zanini, N. (2014). Big data and social media analytics. Research
Models during Prior Knowledge Activation in MetaTutor. International Working
Matters: A Cambridge Assessment Publication, 18, 36–41.
Group on Educational Data Mining. Paper presented at the International
Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Conference on Educational Data Mining (EDM) (2nd, Cordoba, Spain, July 1–3,
Technology, Learning and Assessment, 5(1). 2009).
Dubner, S. J., & Levitt, S. D. (2005). Freakonomics: A Rogue Economist Explores the Sukkarieh‚ J. Z., Pulman, S. G. & Raikes, N. (2003). Auto-marking: using
Hidden Side of Everything. New York City: William Morrow. computational linguistics to score short‚ free-text responses. Paper presented
at the Proceedings of 29th International Association for Educational
Einav, L., & Levin, J. D. (2014). The Data Revolution and Economic Analysis.
Assessment (IAEA) Annual Conference.
Innovation Policy and the Economy, 14(1), 1–24.
Suhoy, T. (2009). Query indices and a 2008 downturn: Israeli data. Discussion
Feinberg, J. (2010). Wordle, in J. Steele & N. Iliinsky (Eds.) Beautiful visualization,
paper No. 2009.06. Research Department, Bank of Israel.
Sebastopol: O'Reilly Media, Inc.
Theroux, J.M. (2009). Real-time case method: analysis of a second
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L.
implementation. Journal of Education for Business, 84(6), 367–373.
(2009). Detecting influenza epidemics using search engine query data. Nature,
457(7232), 1012–1014. United Nations (UN) (2014). Mining Indonesian Tweets to Understand Food Price
Crises. UN Global Pulse Report. Retrieved from: http://www.unglobalpulse.org/
Guzman, G. (2011). Internet search behavior as an economic forecasting tool: The
sites/default/files/Global-Pulse-Mining-Indonesian-Tweets-Food-Price-
case of inflation expectations. Journal of Economic and Social Measurement,
Crises%20copy.pdf.
36(3), 119–167.
West, D.M. (2012). Big Data for Education: Data Mining, Data Analytics, and Web
Hughes, J.M., Foti, N. J., Krakauer, D. C., & Rockmore D. N. (2012). Quantitative
Dashboards. Retrieved from: http://www.brookings.edu/research/
patterns of stylistic influence in the evolution of literature. Proceedings of the
papers/2012/09/04-education-technology-west.
National Academy of Sciences, 109(20), 7682–7686.
Worsley, M., & Blikstein, P. (2011). Using machine learning to examine learner's
Huijnen, P., Laan, F., de Rijke, M., & Pieters, T. (2014). A Digital Humanities
engineering expertise using speech, text, and sketch analysis, in Paper
Approach to the History of Science. In A. Nadamoto, A. Jatowt, A. Wierzbicki &
presented at the 41st Annual Meeting of the Jean Piaget Society (JPS).
J. Leidner (Eds.), Social Informatics, (8359), 71–85. Springer Berlin Heidelberg.
University of California, Berkeley.
Imbos, T., & Ambergen, T. (2010). Text analytic tools for the cognitive diagnosis of
Wu, L., & Brynjolfsson, E. (2013). The future of prediction: How Google searches
student writings. Paper presented at the Proceedings of the ICOTS8,
foreshadow housing prices and sales, in S. M. Greenstein, A. Goldfarb and C.
International Conference on Teaching Statistics.
Tucker (Eds.) Economics of Digitization, Chicago: University of Chicago Press.
JISC (2008). Text Mining Briefing Paper. Joint Information Systems Committee.
Retrieved from: http://jisc.ac.uk/media/documents/publications/
bptextminingv2.pdf.
44 | R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5