0% found this document useful (0 votes)

7 views7 pages

Research Paper Text Mining

The article provides an introduction to Text Mining (TM), a set of techniques for analyzing unstructured text data, which has gained importance due to technological advancements and the rise of Big Data. It outlines the stages of TM, including Information Retrieval, Natural Language Processing, Information Extraction, and Data Mining, and discusses various applications across fields such as policy-making, business, and academia. The article also highlights the potential of TM in extracting insights from large text corpuses, including sentiment analysis and literary style comparisons.

Uploaded by

Ayesha Asad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views7 pages

Research Paper Text Mining

Uploaded by

Ayesha Asad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

This is a single article from Research Matters: A Cambridge Assessment publication. http://www.cambridgeassessment.org.

uk/research-matters/
© UCLES 2015

Text Mining: An introduction to theory and some

applications
Nadir Zanini and Vikas Dhawan Research Division

Introduction academic community and to provide leadership in its use in learning,

teaching, research and administration. The potential of TM has also been
Recent technological advances have led to the availability of new types of recognised elsewhere in the world. For example, in Italy, Cineca
observations and measurements that were previously not available and (a consortium made up of 54 Italian universities and the Ministry of
that have fuelled the ‘Big Data’ trend (Dhawan & Zanini, 2014). Along Education, University and Research) has been using one of the most
with standard structured forms of data (containing mainly numbers), powerful computers in the world to design and develop information
modern databases include new forms of unstructured data comprising systems and TM solutions for public administration, health care and
words, images, sounds and videos which require new techniques to be business.
exploited and interpreted. TM can be a strategic source of evidence-based information that can
This article focusses on Text Mining (TM), that is a set of statistical and support the decision-making process in different fields, from policy-
computer science techniques specifically developed to analyse text data. making to business. For this reason, researchers and practitioners from
It aims to give a theoretical introduction to TM and to provide some various fields are using TM.
examples of its applications. Text has always been an informative source
of insight into a specific field or individuals. However, with the advent of
new technologies, text data are also being predominantly used in new The logic (and technology) behind Text Mining
forms of communication. New sources of text data are now available,
such as text messaging, social media activity, blogs and web searches. Broadly speaking, the overarching goal of TM is to turn text into data so
The increasing availability of published text, sophisticated technologies that it is suitable for analysis. To achieve this there is a need for applying
and growing interest in organisations in extracting information from text computationally-intensive artificial intelligence algorithms and statistical
has led to replacing (or at least supplementing) the human effort with techniques to text documents. As stated in a JISC briefing paper (JISC,
automatic systems. 2008), TM employs a wide range of tasks that can be combined together
TM can be used for a variety of scopes, ranging from basic descriptions into a single workflow, in which it is possible to distinguish four different
of text content through word counts to more sophisticated uses such as stages:
finding links between authors and evaluating the content of scripts (e.g., 1. Information Retrieval
automated marking of essays). 2. Natural Language Processing (NLP)
TM refers to the process of extracting meaningful numeric indices from
3. Information Extraction and
text. It owes its origin to a combination of various related fields – Data
4. Data Mining.
Mining (DM), Artificial Intelligence, Statistics, Database Management,
Library Science and Linguistics (Anawis, 2014). Its basic purpose is to Information retrieval
process the unstructured information contained in text data in order to
The first stage of TM is to identify the relevant documents from a large
make text accessible to various DM statistical algorithms. This could help
collection of digital text documents. Information Retrieval systems used
make text data as informative as standard structured data and allow us
are aimed at identifying the subset of documents which match a user’s
to investigate relationships and patterns which would otherwise be
query. Two examples of Information Retrieval systems are the tools used
extremely difficult, if not impossible, to discover. With TM, information
in libraries to search for books on a specific topic and web search engines
contained in the text can be categorised and clustered with the aim of
(e.g., Google, Bing) designed to search for information in the World Wide
producing results such as word frequency distribution, pattern
Web.
recognition and predictive analytics which might not be easily available
using standard data (JISC, 2008). Natural Language Processing
The possibility of analysing text data is recognised as one of the main
Once a subset of text documents has been retrieved the character strings
elements of the Big Data trend (Lohr, 2012) and a leading source of
have to be processed in order to be analysed by computers. The
information for data journalism (Rogers, 2011). In recent years, greater
computers need to be fed input in a specific format so that they can
understanding of the potential of TM has led government/public
understand natural languages as humans do (Manning & Schütze, 1999).
authorities and private organisations to play an active role in developing
this technology. The National Centre for Text Mining (NaCTeM) was
possibly the first publicly-funded TM centre in the world1, established by
1. See NaCTeM web page at http://www.nactem.ac.uk/
the UK’s JISC2 and operated by the University of Manchester (for an
2. JISC (formerly known as the Joint Information Systems Committee) is a UK non-departmental
introduction to NaCTeM see Ananiadou, 2005). NaCTeM was established public body whose role is to support post-compulsory education and research, providing
in 2004 to provide TM services in response to the requirements of the UK leadership in the use of ICT in learning, teaching, research and administration.

38 | R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5
The main difficulty is that, often, the hidden structure of natural From basic word counts to sentiment analyses
language is highly ambiguous. Although this might jeopardise the Some of the applications of TM require very basic statistics, frequencies
outcome, developments in NLP have led to a high degree of success in for instance. Counting the occurrence of one or more words from a
certain tasks. NLP enables us to (JISC, 2008): document is the most common TM application, but it does require new
● classify words into grammatical categories (e.g., nouns, verbs); ways to visualise this kind of data. For example, Wordle, a free tool
● disambiguate the meaning of a word, among the multiple meanings available online (http://www.wordle.net/) generates tag clouds of the
that it could have, on the grounds of the content of the document; words contained in a document (Feinberg, 2010). The size of each word is
● parse a sentence, that is, perform a grammatical analysis that enables proportional to its relative frequency in the document (similar to a
us to generate a complete representation of the grammatical bubble plot).
structure of a sentence, not just identify the main grammatical The technological advances that have fuelled TM development have
elements in a sentence. not just inspired new data visualisations, but also stimulated the
collection of new ‘textbases’, such as Project Gutenberg and Google
During this stage of TM, the linguistic data about text are extracted from, Books. For instance, digitising and archiving books allows us to calculate
and marked-up to, the documents which still hold an unstructured form the frequency of a word in a book, or in all the books published in a
of data. specific year or to visualise the occurrence of certain words over time.
For books available in Google Books, Figure 1 gives an example of the
Information Extraction
occurrence of the words ‘information’ and ‘news’ in books published
In order to be mined as any other kind of data, the unstructured natural during the last century. Whilst the word ‘news’ appears to have been
language document must be turned into data in a structured form. This steadily used by authors over the last century, the word ‘information’
stage is called Information Extraction and it is the data generated by experienced a notable increase: from about the same level as ‘news’ in
NLP systems. The most common task performed during this stage is the the early 1900s, to six times more than ‘news’ in the year 2000.
identification of specific terms, which may consist of one or more words,
as in the case of scientific research documents containing many complex 0.0450%

0.0400%
multi-word terms. information
Percentage of words in

0.0350%
Google Books sample

Information extraction also allows us to link names and entities (e.g.,

0.0300%
people and the organisation to which they are affiliated) and more 0.0250%
complex facts such as relationships between events or names. 0.0200%

0.0150%

Data Mining 0.0100%

news
0.0050%
When the structured database is filled with the information extracted
0.0000%
from the annotated documents provided by NLP algorithms, data are 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

finally ready to be mined. In this context ‘mining’ is a synonym of Years

‘analysing’, as the aim is to draw useful information from the text data in Figure 1: Searches for the words ‘information’ and ‘news’ in Google Books
order to build up new knowledge. To do this, given that data are now in a (digitalised books originally published between 1900 and 2000)

structured form, it is possible to make use of standard statistical Image sourced from Google Books Ngram Viewer.
Retrieved from https://books.google.com/ngrams
procedures and techniques applied to text data that are now in
structured form.3
Word counts and the availability of large-scale ‘textbases’ give the
opportunity to analyse the evolution of literary styles and trends over
Applications of Text Mining time and across countries. This kind of analysis belongs to a new field of
study known as ‘culturomics’ (Ball, 2013). For example, in a recent study,
The first applications of TM surfaced in the mid-1980s.4 However its a group of researchers mined a sample of 7,733 works obtained from the
growth has been led by technological advances in the last ten years. TM Project Gutenberg Digital Library written by 537 authors after the year
has being increasingly employed in applied research in different areas 1550 (Hughes, Foti, Krakauer,& Rockmore , 2012). They focused on the
(such as epidemiology, economics and education) as well as for business- use of 307 content-free words (e.g., prepositions, articles, conjunctions
related purposes, especially for gaining market and consumer insights and and common nouns) claiming that these words provide a useful stylistic
to develop new products. The techniques of TM are common to both fingerprint for authorship and can be used as a method of comparing
academic research and business-oriented analytics. author styles. For each author a similarity index with every other author
was computed. This index, based on the occurrences of each content-free
word considered in the study, was used to exploit temporal trends in the
usage of content-free words. Their primary finding was that authors tend
to have important stylistic connections to other authors closer to them
in time, but not necessarily to immediate contemporaries. They noticed
3. Among the most common statistical packages used by researchers, the text analytics tools are that, for books published within three years of each other, the similarity
‘Text Miner’ and ‘Enterprise Miner’ (SAS), ‘TM – Text Mining Infrastructure’ (R) and ‘Modeler’ index is very high, but slightly smaller than the one shown for books
(SPSS).
published within ten years of each other. For books published with a
4. See, for example, the Content Analysis of Verbatim Explanations Research project.
http://www.ppc.sas. upenn.edu/cave.htm temporal distance of more than ten years, the similarity index decreased

R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5 | 39
until reaching a stable value for books published with a temporal distance
of 350 years. 1.0

Another innovative piece of research, carried out by Matthew Jockers

of the University of Nebraska-Lincoln, focused on comparing the stylistic
and thematic connections amongst eighteenth and nineteenth century 0.5

Joy-Sadness (z-scores)
authors. A massive amount of text data using digital versions of nearly
3,500 books was processed to investigate how books were connected to
one another on criteria such as frequency of words, choice of words and 0.0
overarching subject matter (Jockers, 2013). Each book was then affixed
with unique attributes and plotted graphically. Figure 2 shows the books
analysed from the late 1700s to the early 1900s. The books plotted closer –0.5
to each other represent a close relationship in terms of styles and
themes. Figure 2 highlights the example of Herman Melville’s Moby Dick
published in 1851 which appears here as an outlier from much of the –1.0
literary work of the period while still being related to several works by
James Fenimore Cooper (Sea Lions published in 1849 and The Crater
1900 1920 1940 1960 1980 2000
published in 1847).
Year
Moby Dick

Mardi Figure 3: Historical periods of positive and negative moods

Note: Difference between z-scores of Joy and Sadness for years from 1900 to 2000
Typee
Cooper, The Crater (raw data and smoothed trend). Values above zero indicate generally ‘happy’ periods,
Cooper, Sea Lions and values below the zero indicate generally ‘sad’ periods.
Melville Cluster Omo-o
Image originally published by Acerbi et al. (2013) under open access licence.
Retrieved from hhttp://www.plosone.org/static/licence

information from text documents such as social media posts. Sentiment

analysis is one of the main research strands of Global Pulse, a new
initiative by the United Nations (UN) aimed at leveraging the use of Big
Data for global development. In a recent work, Twitter conversations
related to food price inflation amongst Indonesians were investigated.
The research found a significant correlation between official food
inflation rates and the number of tweets about this topic (UN, 2014).
The study concluded that automated monitoring of public sentiment on
social media, combined with contextual knowledge, has the potential to
Figure 2: Graphical distribution that displays connections, insights and trends be a valuable real-time alternative to official statistics (usually released
about the literary world from the late 1700s to the late 1900s
after a certain time lag) and to uncover people’s reactions in contexts
Image courtesy of Matthew Jockers (University of Nebraska-Lincoln).
where the use of social media is widespread.
Sentiment mining has also been exploited in other research contexts,
Recent research led by Durham University studied the use of emotion- such as the understanding of political and historical trends (Ceron, Curini,
related words in recent history (Acerbi, Lampos, Garnett & Bentley, Iacus & Porro, 2014; Huijnen, laan, de Rijke & Pieters, 2014). Social media
2013). Based on these words this research found that there was a ‘sad’ websites and other computational tools (e.g., Google Books Ngram
peak during the Second World War and two ‘happy’ peaks – one in the Viewer) are being used for research in this area. This approach could help
1920s and another in the 1960s (see Figure 3). A ‘sad’ period was also retrieve hidden information in a large corpus of text documents including
noticed during the 1970s and the 1980s followed by an increase in speech transcripts by writers and speakers.
happiness-related words around 1990–2000. The study pointed out that
in general, the use of emotion-related words has reduced in the past Links amongst words and text pattern recognition
century. The study also compared historical trends in the use of emotion- Basic statistics are sufficient to summarise, categorise and cluster
related words between British and American authors. Prior to 1980, the information from text documents. TM, in addition, may be helpful to
difference between them was barely significant, but since then emotion- generate meaningful links across different documents when decision-
related words have been used more frequently by US authors than UK makers are overloaded with unstructured information, such as news
ones. articles in the case of financial market agents. At times, TM could help
Mining of social opinions is becoming a common marketing and brand reveal unexpected connections between documents. The relationship
management strategy used by organisations. This kind of analysis between the use of certain words in real estate advertisements and the
includes understanding what people say or share in their everyday life, price of the house advertised make an interesting example. In their best-
particularly online. This area of research is known as ‘opinion mining’ or selling book Freakonomics, Dubner and Levitt (2005) listed five terms
‘sentiment analysis’. Its aim is to identify and extract subjective commonly used in real-estate advertisements in the USA associated with

40 | R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5
a) higher sale price and b) lower sale price. Table 1 gives the five terms for An example of such trends is given in Figure 4. It shows the
both (in order of their association with price). The more expensive houses comparison of text searches in Google for the terms ‘OCR’, ‘Edexcel’ and
were described using words which were all related to the physical ‘AQA’ (the names of three awarding bodies based in England, Wales and
description of the house such as ‘granite’ and ‘maple ’. Unexpectedly, Northern Ireland) from January 2011 to September 2014.7 The searches
words such as ‘fantastic’ and ‘charming’ were used more often for for the three awarding bodies follow a similar pattern to each other
cheaper houses. The authors suggest that these words are used as a sort which, not unexpectedly, depict a seasonal component: the two peaks are
of real-estate agent code to attract potential customers for a house in June and January of year each (except for January 20148), when the
which doesn’t have many saleable attributes. majority of students sit the exams, whilst August has fewer searches,
when schools are closed. During examination sessions AQA was the most
Table 1: Terms used in USA real-estate adverts and their association with house searched, while OCR had the highest number of searches from
price (Dubner & Levitt, 2005). September to December.9 Google Trends also provides a list of related
searches, that is, popular search terms that are associated with the term
Five terms associated with Five terms associated with searched. In the example given here, for all three awarding bodies, the
higher price lower price
most related search was their name followed by the term ‘past papers’
Granite Fantastic (e.g., ‘OCR past papers’). The second most frequent related search was the
State-of-the art Spacious name of the awarding body followed by ‘GCSE’ (e.g., OCR GCSE). We also
Corian® ! observed that while the most searched subject for OCR and AQA was
Maple Charming
Biology, for Edexcel it was Mathematics.
Gourmet Great neighbourhood
100

However, we need to be careful in drawing interpretations from text 80

data. For instance, it has been reported in a post written in a language

60
blog by the computational linguistic Mark Liberman5 that statistically
significant correlations were unexpectedly found between words
40
apparently not linked, such as ‘some’ and ‘all’, ‘the’ and ‘you’. This suggests
that, although it is not hard to find patterns in large datasets, the results 20

may not be meaningful or not always straightforward to interpret and the

patterns could also be attributed purely to sampling error. 0

Jan 2
13

Ma 13

3
13

No 13

Jan 3
Ma 14

Ma 14

4
14

4
11

Ma 11

Jan 1
12

Ma 12

2
12

No 12

01
01

01
Word pattern recognition has also been applied to everyday working

20
20

p2
r2

r2
r2

Jul

Jul
Jan

Jul

Ma
Ma

Ma
No

Se
Se

life. Automated systems (known as eCRM – Customer Relationship

OCR Edexcel AQA
Management) have been developed as an attempt to categorise incoming
email, and to automatically respond to users with standard answers to Figure 4: Text searches for ‘OCR’, ‘Edexcel’, and ‘AQA’ from January 2011 to
frequently asked questions. September 2014

One of the most familiar applications of TM technology and machine Image sourced from Google Trends. Retrieved from http://www.google.com/trends
learning techniques is Google Translate, a free, multilingual translation
service provided by Google Inc. to translate written text from/into 63 It has been shown that the number of text queries that users enter
languages. Google Translate is based on a large scale statistical analysis, into web search engines such as Google and Yahoo can be used for
rather than traditional grammatical rule-based analysis. To generate a predictive modelling for forecasting values of a number of measures of
translation, Google Translate looks for patterns in hundreds of millions of interest. Researchers in epidemiology discovered that search requests for
documents that have already been translated by human translators and terms like ‘flu symptoms’ and ‘flu treatments’ were a good predictor of
are available on the web. This process of seeking patterns in large the number of patients who, in the period 2004–2008, required access to
amounts of text is called ‘statistical machine translation’ (Och, 2005). 6 USA hospital emergency rooms in the next two weeks (Polgreen, Chen,
Clearly, the more human-translated documents that Google Translate can Pennock, Nelson & Weinstein, 2008; Ginsberg et al., 2009). With reference
analyse in a specific language, the better the translation quality will be. to 2013, it was reported that these web searches were predicting more
than double the proportion of doctor visits for influenza-like illness that
Publicly available data and predictive modelling were actually recorded. This was probably caused by a change in the
With the advent of new technologies, a source of data is not just a Google search algorithm (Lazer, Kennedy, King, & Vespignani, 2014).
document for TM, the search for that document itself can provide useful Although this discovery can undermine the suitability of web searches as
insights. In the case of documents available online, web searches through a predictive method, it has been proven to be a good source of
search engines can be informative. Google, for example, set up Google
Trends, which allows internet users to easily access metrics on Google 7. Google Trends does not provide data on the access to the website (which is something that
Google Analytics does, though this is not publicly accessible). So the data plotted in Figure 4 are
searches.
not ‘visits’ to the three awarding bodies’ websites, but only ‘searches’. Moreover, data provided
does not show the actual volume of searches, but only an indicator estimated in relation to the
maximum value of searches across the comparison which is set to 100.

5. Significant (?) relationships everywhere. Language Log. Retrieved from: 8. It should be noted that in 2014, there was no January exam sitting.
http://languagelog.ldc.upenn.edu/nll/ ?p=4686#more-4686 9. Note that the results might have been different if, for instance, ‘Pearson’ or ‘Pearson Edexcel’ had
6. See also the webpage of the Google Research team at http://research.google.com/pubs/Machine been used instead of ‘Edexcel’. Pearson has been the parent company of Edexcel since 2003. In
Translation.html 2010, the legal name of the Edexcel awarding body became Pearson Education Limited (Pearson).

R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5 | 41
information when combined with traditional sources of data. Web search business problem. Furthermore, from the analysis of text data, it was
data combined with official statistics have been extensively used to possible to discover that, during the case study, students learnt how to
predict the unemployment rate in different countries such as the US use a language more similar to the one used in the real business world.
(Ettredge, Gerdes, & Karuga, 2005; D’Amuri & Marcucci, 2010), Germany In an evaluation of this experiment, students affirmed that they liked
(Askitas & Zimmermann, 2009) and Israel (Suhoy, 2009). It has also been this new teaching approach and would like to see more of it at their
shown that web search data employed as an explanatory variable, along schools as they found it very applicable to real life (Theroux, 2009).
with the previous historical trends of the dimension of interest, can A second example of the use of TM to gather insights on learners’
sensibly improve short-term predictions of other social and economic cognition is a study aimed at analysing students’ progression in a
indicators such as inflation (Guzman, 2011). Therefore, predictive computer programming class. In this study, a software package was used
modelling could also enable central banks and other national and to gather data during a programming assignment from nine learners
international agencies to improve the timing and the accuracy of the (Blikstein, 2011). The software allowed researchers to build a 1.5 GB
policy measures they publish to inform policy makers. It can also be dataset of 18 million lines of events (such as keystrokes, code changes,
applied to economic metrics for business-related purposes and analysing error messages and actual coding snapshots). An in-depth automated
customer insights. exploration of each student’s coding strategies summarised by this
Evidence has shown that web search queries “…can be useful leading mixture of structured and text data was compared with those of other
indicators for subsequent consumer purchases in situations where students. The author discovered that error rates progressed in an ‘inverse
consumers start planning purchases significantly in advance of their parabolic shape’. This means that, initially, students made a lot of
actual purchase decision” (Choi & Varian, 2012). For instance, search mistakes, but they demonstrated that they were able to learn from them
engine data related to housing search enquiries has been shown to be a through problem-solving and progressed until they had completed their
more accurate predictor of house sales in the next quarter than the assignment. Although this is a small-scale study and it is not possible to
forecasts provided by real estate economists (Wu & Brynjolfsson, 2013). make any claims about statistical significance, it suggests that using a
Web search queries have also been successfully employed to improve the sophisticated TM application might lead to a better understanding of
predictability of motor vehicle demand and holiday destinations (Choi & students’ coding styles and sophisticated skills such as problem-solving.
Varian, 2012). These are applications of the terms attributed to Choi and An extensive use of the recent developments in NLP has also been
Varian’s – ‘contemporaneous forecasting’ or ‘nowcasting’, because they employed to automatically detect secondary students’ mental models in
can help in ‘predicting the present’, rather than the future (Choi & Varian, order to gain a better understanding of their learning processes. In an
2012). experiment students were asked to write short paragraphs about the
The use of predictive modelling has also been adapted by online human circulatory system in order to recall knowledge about the topic.
retailers to gain customer insights. Amazon and Netflix Using an intelligent tutoring system (MetaTutor) that teaches students
recommendations, for example, rely on predictive models of what book self-regulatory processes during learning of complex Science topics and
or film a customer might want to purchase on the basis of their history applying TM techniques, researchers explored which particular machine
of enquiries to the website or similar purchases made by other customers learning algorithm would enable them to accurately classify each student
(Einav & Levin, 2014). In general, online advertising and marketing tends in terms of their content knowledge (Rus & Azevedo, 2009). Mental
to rely on automated predictive algorithms that target customers who models represent an expanding field of research among cognitive
might be interested in responding to offers. psychologists and are aimed at better understanding how well an
Predictive modelling based on text data extends well beyond the individual organises content in meaningful ways. TM allows researchers to
online world. One of the most famous applications is the development of undertake analysis that can reveal inaccuracies and omissions that are
algorithms that make use of text data contained in different forms of crucial for deep understanding and application of course material, thus
communication (e.g., mobile texts and emails) to detect terrorist threats informing improvements in course design.10
and to identify fraudulent behaviour in healthcare and financial services A number of systems using TM have been developed for automated
(Einav & Levin, 2014). marking of essays and short, free text responses (for an example of the
latter see Sukkarieh et al., 2003). Some of the most widely used
automated essay marking systems available in the market include: Project
Essay Grader, Intelligent Essay Assessor, E-rater, Criterion, IntelliMetric,
Applications of Text Mining in education
MY Access and Bayesian Essay Test Scoring System. They have been
The benefits offered by the interaction of text and other data analytics in developed to reduce time and cost and improve reliability and
improving learning processes are already being valued by education generalisability of the process of assessment in low-stakes classroom
practitioners as well as by learners themselves. tests, as well as for large-scale assessment such as national standardised
The first example is the implementation of an experimental real-time examinations. The accuracy and reliability of these automated systems
case study in a business course. Lecturers made use of internet-based have been investigated by educational researchers in the last fifteen
software to facilitate written communication among students, teachers years. Along with the benefits of using TM, some of its disadvantages
and the case organisation. In this way, it was possible to gather a large such as the lack of human interaction and the need for a large corpus of
quantity of text data containing all the email communication among sample texts to train the system, have also been reported (Dikli, 2006).
students and the organisation involved in the case study. Applying simple Automated essay marking systems do not really understand the texts as
text analytics on real-time written communication, such as counting of
specific words, researchers found that, by the end of this experimental
10. For more details on mental model assessment in education see
teaching approach, students had increased their understanding of a live http://mentalmodelassessment.org/

42 | R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5
humans do, so it is not possible to affirm that they emulate the human analyses in order to get better insights on assessment. For example,
marking process. Notwithstanding, automated essay marking systems Worsley & Blikstein (2011) examined students’ dialogues along with
show high agreement rates with human markers; and their supporters other qualitative and quantitative data to develop predictors for student
advocate that the main role of these systems today is not to replace expertise in the area of Engineering design. By leveraging the tools of
teachers and assessors, but to assist them, incorporating these systems as machine learning, NLP, speech analysis and sentiment extraction, the
a supplementary marker, especially in large-scale writing assessments authors identified a number of distinguishing factors of learners at
(Monaghan & Bridgeman, 2005; Kersting, Sherin & Stigler, 2014). different levels of expertise. According to the study, these kinds of
A particular example of automated essay marking is the tool findings motivate further research in this field and the development of a
developed by a team of researchers at Maastricht University to stimulate new paradigm for the evaluation of learner knowledge construction.
students to become active and collaborative learners. It has been used in
Statistics courses to assess students on their understanding of course
content. It makes use of advanced NLP and Latent Semantic Analysis Discussion
algorithms that can be used in automatic marking of the texts. Mining
students’ essays, researchers were easily able to automatically The key advantage provided by TM is the opportunity to exploit text
discriminate between the reference book chapter text and the documents records, on a very large scale. In this article we have briefly described the
of the students. However, it is less clear whether this tool is able to techniques of TM and some of its applications.
discriminate students from one another (Imbos & Ambergen, 2010). TM has a variety of potential applications in the field of education.
Despite its weaknesses, marking essays automatically continues to In formative and summative assessment, for instance, it could be used to
attract the attention of schools, universities, assessment organisations, understand trends in vocabulary usage over time and the use of spelling
researchers and educators. Although it might be difficult for these and punctuation. To date, these applications have been carried out by
systems to supersede human markers, TM can be employed to support teachers and assessment experts without using advanced techniques
human markers as a second or third marker (see, for instance, Landauer, such as TM, but TM allows the possibility of implementing these
2003 and Attali & Burstein, 2006). The Centre for Digital Education (CDE) applications on a more comprehensive scale. The developments in NLP
reported that, in the USA, around $20 billion was spent on public allow educational professionals to analyse the language structure of a
education in Information Technology in 2012, with an increase of 2 per vast amount of text documents in just a few minutes, plus the ongoing
11
cent from the previous year . The awareness of the potential of TM and developments in this field could result in an increase in the accuracy of
DM in, for instance, formative assessment, has led McGraw-Hill to the findings.
develop two different tools, Acuity Predictive Assessment and Acuity The availability of novel data could lead, at least in principle, to novel
Diagnostic Assessment, aimed at informing teachers and learners about measurement and research designs to address old and new research
their performance and how to improve it (CDE, 2014). questions. However, working with very large, rich and new kind of
These tools can be employed for formative assessment. Predictive datasets, it might not be straightforward to figure out what questions the
modelling of text data can provide an early indication of how students data could answer accurately. Asking the right question might be more
will perform on a standardised test. It allows assessment of the gap important now than ever (Einav & Levin, 2014). Exploiting large text
between what students are expected to know and what they actually datasets without a proper research question might lead to a significant
know. It can also provide evidence regarding which area of the syllabus waste of resources.
they have to focus on to improve their performance (West, 2012). Also, More heterogeneous and in-depth data could allow researchers to
more advanced analysis could be informative to teachers about which move from methods that allow the estimation of average relationships in
particular teaching techniques are more efficient for specific students and the population towards differential effects for specific subpopulations of
the best ways to tailor the learning approach to them (Bienkowski, Feng interest. This could mean looking at particular categories of students,
& Means, 2012). defined by their specific background, level of achievement and other
Students’ reading comprehension, for example, has been the object of characteristics of interest. TM is an expanding field with the potential to
a study based on the use of intelligent tutoring software. The analysis of support innovative areas of research. With careful research designs and
data such as students’ reading mistakes and word knowledge gathered proper methods, TM could make a salient contribution to educational
through a speech recognition tool showed that re-reading an old story research.
helped pupils learn half as many words as reading a new story (Beck &
Mostow, 2008). An online tool called WebQuest provides activities
References
designed for teachers to train pupils in skills such as information
Acerbi, A., Lampos, V., Garnett, P., & Bentley, R. A. (2013). The expression of
acquisition and evaluation of online materials. Students who have
emotions in 20th century books. PloS one, 8(3), e59030.
experienced these kinds of activities have reportedly enjoyed the
Ananiadou, S., Chruszcz, J., Keane, J., McNaught, J., & Watry, P. (2005). The
collaborative and interactive nature of the activities (Perkins & McKnight,
National Centre for Text Mining: Aims and Objectives. Ariadne, 42. Retrieved
2005).
from: http://www.ariadne.ac.uk/issue42/ananiadou.
Predictive modelling in educational assessment has been mainly based
Anawis, M. (2014). Text Mining: The Next Data Frontier. Scientific Computing.
on numeric data (e.g., days of truancy, overall grades and disciplinary
Retrieved from: http://www.scientificcomputing.com/blogs/2014/01/text-
problems). However, text data could be used to enable more in-depth mining-next-data-frontier.

Askitas, N., & Zimmerman, K. F. (2009). Google Econometrics and

Unemployment Forecasting. Applied Economics Quarterly (formerly:
11. Centre for Digital Education: http://www.centerdigitaled.com/research/ Konjunkturpolitik), Duncker & Humblot, Berlin, 55(2), 107–120.

R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5 | 43
Attali, Y. & Burstein, J. (2006). Automated Essay Scoring With e-rater® V.2. The Jockers, M. (2013). Macroanalysis: Digital Methods and Literary History.
Journal of Technology, Learning, and Assessment, 4(3). Champaign: University of Illinois Press.

Ettredge, M., Gerdes, J., & Karuga. G. (2005). Using web-based search data to Kersting, N. B., Sherin, B. L. & Stigler, J. W. (2014). Automated Scoring of Teachers’
predict macroeconomic statistics. Communications of the ACM, 48(11), 87–92. Open-Ended Responses to Video Prompts: Bringing the Classroom-Video-
Analysis Assessment to Scale. Educational and Psychological Measurement,
Ball, P. (2013, 21 March). Text mining uncovers British reserve and US emotion.
74(6), 950–974.
Nature. Retrieved from: http://www.nature.com/news/text-mining-uncovers-
british-reserve-and-us-emotion-1.12642. Landauer, T. K. (2003). Automatic Essay Assessment, Assessment. Education:
Principles, Policy & Practice, 10(3), 295–308.
Beck, J., & Mostow, J. (2008). How Who Should Practice: Using Learning
Decomposition to Evaluate the Efficacy of Different Types of Practice for Lazer, D., Kennedy, R., King, G. & Vespignani, A. (2014, 14 March). The Parable of
Different Types of Students. In B. Woolf, E. Aïmeur, R. Nkambou & S. Lajoie Google Flu: Traps in Big Data Analyisis. Science, 343(6176), 1203–1205.
(Eds.), Intelligent Tutoring Systems, (5091), 353–362. Springer Berlin Heidelberg. Retrieved from: http://www.sciencemag.org/content/343/6176/1203.

Bienkowski, M., Feng, M., & Means, B. (2012). Enhancing Teaching and Learning Lohr, S. (2012, 11 February). The Age of Big Data. The New York Times. Retrieved
Through Educational Data Mining and Learning Analytics: An Issue Brief. U.S. from: http://www.nytimes.com/2012/02/12/sunday-review/big-datas-impact-
Department of Educational, Office of Educational Technology. Retrieved from: in-the-world.html?pagewanted=all&_r=0.
http://tech.ed.gov/wp-content/uploads/2014/03/edm-la-brief.pdf
Manning, C.D., & Schütze, H. (1999). Foundations of Statistical Natural Language
Blikstein, P. (2011). Using learning analytics to assess students' behavior in open- Processing. Boston: MIT press.
ended programming tasks. Paper presented at the Proceedings of the 1st
Monaghan, W., & Bridgeman, B. (2005). E-rater as a Quality Control on Human
international conference on learning analytics and knowledge.
Scorer. ETS RD Connections. Retrieved from: http://www.ets.org/Media/
Centre for Digital Education (CDE) (2013). Big Data, Big Expectations. The Promise Research/pdf/RD_Connections2.pdf.
and Practicability of Big Data for Education. The Centre for Digital Education.
Och, F.J. (2005). Statistical Machine Translation: Foundations and Recent Advances.
Retrieved from: http://www.centerdigitaled.com/paper/259374351.html
Retrieved from: http://www.mt-archive.info/MTS-2005-Och.pdf.
Ceron, A., Curini, L., Iacus, S. M., & Porro, G. (2014). Every tweet counts? How
Perkins, R., & McKnight, M.L. (2005). Teachers' attitudes toward WebQuests as a
sentiment analysis of social media can improve our knowledge of citizens’
method of teaching. Computers in the Schools, 22(1–2), 123–133.
political preferences with an application to Italy and France. New media and
society, 16(2), 340–358. Polgreen, P.M., Chen, Y., Pennock, D. M., Nelson, F. D., & Weinstein, R. A. (2008).
Using Internet Searches for Influenza Surveillance. Clinical infectious diseases,
Choi, H., & Varian, H. (2012). Predicting the Present with Google Trends. Economic
47(11), 1443–1448.
Record, 88(1), 2–9.
Rogers, S. (2011, 28 July). Data journalism at the Guardian: what is it and how do
D’Amuri, F., & Marcucci, J. (2010). “Google it!” Forecasting the US unemployment
we do it? The Guardian Datablog. Retrieved from:
rate with a Google job search index. ISER Working Paper Series 2009–32.
http://www.theguardian.com/news/datablog/2011/jul/28/data-journalism
Institute for Social & Economic Research (ISER).
Rus, V., Lintean, M., & Azevedo, R. (2009). Automatic Detection of Student Mental
Dhawan, V., & Zanini, N. (2014). Big data and social media analytics. Research
Models during Prior Knowledge Activation in MetaTutor. International Working
Matters: A Cambridge Assessment Publication, 18, 36–41.
Group on Educational Data Mining. Paper presented at the International
Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Conference on Educational Data Mining (EDM) (2nd, Cordoba, Spain, July 1–3,
Technology, Learning and Assessment, 5(1). 2009).

Dubner, S. J., & Levitt, S. D. (2005). Freakonomics: A Rogue Economist Explores the Sukkarieh‚ J. Z., Pulman, S. G. & Raikes, N. (2003). Auto-marking: using
Hidden Side of Everything. New York City: William Morrow. computational linguistics to score short‚ free-text responses. Paper presented
at the Proceedings of 29th International Association for Educational
Einav, L., & Levin, J. D. (2014). The Data Revolution and Economic Analysis.
Assessment (IAEA) Annual Conference.
Innovation Policy and the Economy, 14(1), 1–24.
Suhoy, T. (2009). Query indices and a 2008 downturn: Israeli data. Discussion
Feinberg, J. (2010). Wordle, in J. Steele & N. Iliinsky (Eds.) Beautiful visualization,
paper No. 2009.06. Research Department, Bank of Israel.
Sebastopol: O'Reilly Media, Inc.
Theroux, J.M. (2009). Real-time case method: analysis of a second
Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L.
implementation. Journal of Education for Business, 84(6), 367–373.
(2009). Detecting influenza epidemics using search engine query data. Nature,
457(7232), 1012–1014. United Nations (UN) (2014). Mining Indonesian Tweets to Understand Food Price
Crises. UN Global Pulse Report. Retrieved from: http://www.unglobalpulse.org/
Guzman, G. (2011). Internet search behavior as an economic forecasting tool: The
sites/default/files/Global-Pulse-Mining-Indonesian-Tweets-Food-Price-
case of inflation expectations. Journal of Economic and Social Measurement,
Crises%20copy.pdf.
36(3), 119–167.
West, D.M. (2012). Big Data for Education: Data Mining, Data Analytics, and Web
Hughes, J.M., Foti, N. J., Krakauer, D. C., & Rockmore D. N. (2012). Quantitative
Dashboards. Retrieved from: http://www.brookings.edu/research/
patterns of stylistic influence in the evolution of literature. Proceedings of the
papers/2012/09/04-education-technology-west.
National Academy of Sciences, 109(20), 7682–7686.
Worsley, M., & Blikstein, P. (2011). Using machine learning to examine learner's
Huijnen, P., Laan, F., de Rijke, M., & Pieters, T. (2014). A Digital Humanities
engineering expertise using speech, text, and sketch analysis, in Paper
Approach to the History of Science. In A. Nadamoto, A. Jatowt, A. Wierzbicki &
presented at the 41st Annual Meeting of the Jean Piaget Society (JPS).
J. Leidner (Eds.), Social Informatics, (8359), 71–85. Springer Berlin Heidelberg.
University of California, Berkeley.
Imbos, T., & Ambergen, T. (2010). Text analytic tools for the cognitive diagnosis of
Wu, L., & Brynjolfsson, E. (2013). The future of prediction: How Google searches
student writings. Paper presented at the Proceedings of the ICOTS8,
foreshadow housing prices and sales, in S. M. Greenstein, A. Goldfarb and C.
International Conference on Teaching Statistics.
Tucker (Eds.) Economics of Digitization, Chicago: University of Chicago Press.
JISC (2008). Text Mining Briefing Paper. Joint Information Systems Committee.
Retrieved from: http://jisc.ac.uk/media/documents/publications/
bptextminingv2.pdf.

44 | R E S E A R C H M AT T E R S : I S S U E 1 9 / W I N T E R 2 0 1 5

Text Data Management and Analysis PDF
100% (3)
Text Data Management and Analysis PDF
531 pages
G23 Parts Manual
No ratings yet
G23 Parts Manual
43 pages
Computer Architecture and Organisation Notes
100% (1)
Computer Architecture and Organisation Notes
18 pages
Digital Portable X-Ray Systems: Manual Ver1.7
100% (1)
Digital Portable X-Ray Systems: Manual Ver1.7
47 pages
TEXT ANALYTICS With Python
No ratings yet
TEXT ANALYTICS With Python
37 pages
Jo (2019) - Text Mining
No ratings yet
Jo (2019) - Text Mining
376 pages
FDS-Content Beyond Syllabus
No ratings yet
FDS-Content Beyond Syllabus
15 pages
Uploadsh 046 005807 00 Passport 8 12 Service Manual (FDA) 2 0
No ratings yet
Uploadsh 046 005807 00 Passport 8 12 Service Manual (FDA) 2 0
106 pages
Operating Systems: Assignment 1: System Calls and Scheduling
No ratings yet
Operating Systems: Assignment 1: System Calls and Scheduling
6 pages
React Front End Developer Resume
No ratings yet
React Front End Developer Resume
2 pages
Text Analytics
No ratings yet
Text Analytics
9 pages
Text Mining
No ratings yet
Text Mining
10 pages
Simad University: Chapter 7: Text and Web Mining
No ratings yet
Simad University: Chapter 7: Text and Web Mining
6 pages
Text Mining: A Burgeoning Technology For Knowledge Extraction
100% (1)
Text Mining: A Burgeoning Technology For Knowledge Extraction
5 pages
Effective Classification of Text
No ratings yet
Effective Classification of Text
6 pages
UNIT - 1 Text Mining
No ratings yet
UNIT - 1 Text Mining
18 pages
Submitted To: Submitted By:: Text Mining
No ratings yet
Submitted To: Submitted By:: Text Mining
15 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
Cracking More Password Hashes With Patterns
No ratings yet
Cracking More Password Hashes With Patterns
28 pages
Text Mining: Techniques and Its Application: December 2014
100% (1)
Text Mining: Techniques and Its Application: December 2014
5 pages
Apply Quality Standards
No ratings yet
Apply Quality Standards
71 pages
Case Study On Text Mining
No ratings yet
Case Study On Text Mining
8 pages
Stata Finite Mixture Models Reference Manual: Release 16
No ratings yet
Stata Finite Mixture Models Reference Manual: Release 16
138 pages
Module 4
No ratings yet
Module 4
63 pages
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
No ratings yet
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
73 pages
Leca 102
No ratings yet
Leca 102
70 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Jeffrey Roy Auth. From Machinery To Mobility Government and Democracy in A Participative Age
No ratings yet
Jeffrey Roy Auth. From Machinery To Mobility Government and Democracy in A Participative Age
137 pages
CVPR 2022 MainConference ProgramGuide Final
No ratings yet
CVPR 2022 MainConference ProgramGuide Final
70 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Survey Data Analysis
No ratings yet
Survey Data Analysis
17 pages
Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
63 pages
SEPM QB (2) Merged
No ratings yet
SEPM QB (2) Merged
92 pages
Text Data Management
No ratings yet
Text Data Management
39 pages
Text Mining in Big Data Analytics
No ratings yet
Text Mining in Big Data Analytics
34 pages
Bda Module 5
No ratings yet
Bda Module 5
39 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
No ratings yet
10 - Session 10 - Text Analytics, Text Mining and Sentiment Analysis
36 pages
CH 06 PPTaccessible
No ratings yet
CH 06 PPTaccessible
71 pages
AFM - Module 4
No ratings yet
AFM - Module 4
48 pages
Lec 11 - DW
No ratings yet
Lec 11 - DW
32 pages
Karaoke Tutorial: Part 3: Filters and Tags
No ratings yet
Karaoke Tutorial: Part 3: Filters and Tags
38 pages
Lec 10 Descriptive Analysis
No ratings yet
Lec 10 Descriptive Analysis
18 pages
Single-Chip Microcontrollers (AMCU) : in Brief - .
No ratings yet
Single-Chip Microcontrollers (AMCU) : in Brief - .
31 pages
Lec 13-ETL
No ratings yet
Lec 13-ETL
18 pages
PYTHON
No ratings yet
PYTHON
33 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
3510-6510 Ch5
No ratings yet
3510-6510 Ch5
73 pages
Lecture Adversarial Searches
No ratings yet
Lecture Adversarial Searches
25 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Text Mining and Its Business Applications
No ratings yet
Text Mining and Its Business Applications
17 pages
Text Mining
No ratings yet
Text Mining
16 pages
BI Lec 6 - Hypothesis Testing
No ratings yet
BI Lec 6 - Hypothesis Testing
22 pages
Chapter 1: Text Mining: Big Data Analytics (15CS82)
No ratings yet
Chapter 1: Text Mining: Big Data Analytics (15CS82)
12 pages
Module 1 Part1
No ratings yet
Module 1 Part1
54 pages
Text Mining
No ratings yet
Text Mining
13 pages
Text Mining
No ratings yet
Text Mining
25 pages
BI-Lec 8-9 Decision Making
No ratings yet
BI-Lec 8-9 Decision Making
35 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
IMTC634 - Data Science - Chapter 7
No ratings yet
IMTC634 - Data Science - Chapter 7
24 pages
Text Mining
No ratings yet
Text Mining
12 pages
Introduction To ROC Analysis: Pattern Recognition Letters June 2006
No ratings yet
Introduction To ROC Analysis: Pattern Recognition Letters June 2006
16 pages
SPM Lecture 20
No ratings yet
SPM Lecture 20
21 pages
Comparative Analysis of Text Mining Techniques For
No ratings yet
Comparative Analysis of Text Mining Techniques For
12 pages
Bcse206l FDS Module-4 Smsatapathy
No ratings yet
Bcse206l FDS Module-4 Smsatapathy
50 pages
Business Intelligence and Anlytics UNIT 2
No ratings yet
Business Intelligence and Anlytics UNIT 2
35 pages
Rekabentuk Dan Analisis Produk
No ratings yet
Rekabentuk Dan Analisis Produk
7 pages
6:12 Volt Lead Acid Battery Charger - Power Supply Circuits
No ratings yet
6:12 Volt Lead Acid Battery Charger - Power Supply Circuits
3 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
BI - Lec 7 - Case Study 2
No ratings yet
BI - Lec 7 - Case Study 2
11 pages
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
No ratings yet
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
7 pages
Lab Manual 01 (Introduction)
No ratings yet
Lab Manual 01 (Introduction)
5 pages
(IJCST-V6I4P5) :S.Sheela, T.Bharathi
No ratings yet
(IJCST-V6I4P5) :S.Sheela, T.Bharathi
7 pages
Module 5.4 LOGIC
No ratings yet
Module 5.4 LOGIC
11 pages
Text Mining
No ratings yet
Text Mining
8 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
SPM Lecture 19
No ratings yet
SPM Lecture 19
27 pages
Tapo C310 2.0&2.20&2.26&2.28 - Datasheet
No ratings yet
Tapo C310 2.0&2.20&2.26&2.28 - Datasheet
8 pages
Text Mining Introduction
No ratings yet
Text Mining Introduction
6 pages
Information Retrieval
No ratings yet
Information Retrieval
3 pages
Welcome To The Azure Hands On Lab
No ratings yet
Welcome To The Azure Hands On Lab
8 pages
Chapter 5 Predictive Analytics II Text J Web J and Social Media Analytics
No ratings yet
Chapter 5 Predictive Analytics II Text J Web J and Social Media Analytics
5 pages
Lab11 WebEngineering
No ratings yet
Lab11 WebEngineering
5 pages
DMTerm Paper
No ratings yet
DMTerm Paper
4 pages
Text Mining: 2 History
No ratings yet
Text Mining: 2 History
8 pages
Text Mining Techniques Applications and Issues2
No ratings yet
Text Mining Techniques Applications and Issues2
5 pages
Text Mining 2
No ratings yet
Text Mining 2
4 pages
A Detailed Study On Text Mining Techniques
No ratings yet
A Detailed Study On Text Mining Techniques
4 pages
ICT 7 2nd PT Wanswer
No ratings yet
ICT 7 2nd PT Wanswer
2 pages
Bisf 2204 - Computer Forensics
No ratings yet
Bisf 2204 - Computer Forensics
2 pages
Formatted Cloud Resume
No ratings yet
Formatted Cloud Resume
3 pages
Ansbk Status
No ratings yet
Ansbk Status
2 pages
Ac & DC Ammeters: Fixed Range & Selectable Range (16 Ranges in 1 Meter)
No ratings yet
Ac & DC Ammeters: Fixed Range & Selectable Range (16 Ranges in 1 Meter)
2 pages
Heuristic
No ratings yet
Heuristic
2 pages
Automation Assignment
No ratings yet
Automation Assignment
2 pages
Sonica Eswar Resume
No ratings yet
Sonica Eswar Resume
1 page
Main Features Your Camera Should Have
No ratings yet
Main Features Your Camera Should Have
1 page
Chandigarh Administration Chandigarh Police: JAN SAMPARK: Information Gateway of Chandigarh Administration: 1 of 1
No ratings yet
Chandigarh Administration Chandigarh Police: JAN SAMPARK: Information Gateway of Chandigarh Administration: 1 of 1
1 page
Hall Ticket
No ratings yet
Hall Ticket
1 page
Information Extraction: Fundamentals and Applications
From Everand
Information Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Mining: Concepts, Fundamentals And Applications
From Everand
Data Mining: Concepts, Fundamentals And Applications
Enrico Guardelli
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Research Paper Text Mining

Uploaded by

Research Paper Text Mining

Uploaded by

This is a single article from Research Matters: A Cambridge Assessment publication. http://www.cambridgeassessment.org.

Text Mining: An introduction to theory and some

Introduction academic community and to provide leadership in its use in learning,

Information extraction also allows us to link names and entities (e.g.,

Data Mining 0.0100%

finally ready to be mined. In this context ‘mining’ is a synonym of Years

Another innovative piece of research, carried out by Matthew Jockers

Mardi Figure 3: Historical periods of positive and negative moods

information from text documents such as social media posts. Sentiment

However, we need to be careful in drawing interpretations from text 80

data. For instance, it has been reported in a post written in a language

may not be meaningful or not always straightforward to interpret and the

life. Automated systems (known as eCRM – Customer Relationship

Askitas, N., & Zimmerman, K. F. (2009). Google Econometrics and

You might also like