Method Section-Seminar Paper

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Method

Text mining or text analytics (TM/TA) examines large volumes of unstructured text (corpus),
aiming to extract new information, discover context, identify linguistic motifs, or transform the
text into a structured data format leading to derived quantitative data that can be further
analyzed.

(Dinov, 2018, p. 660)

Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine
learning, statistics, and computational linguistics. A substantial portion of information is stored as
text such as news articles, technical papers, books, digital libraries, email messages, blogs, and web
pages.

Jiawei Han, Jian Pei, Micheline Kamber 596

  Data Mining: Concepts and Techniques

Text mining (also known as text data mining and knowledge discovery in textual databases) is the
process of deriving novel information from a collection of texts (also known as a corpus). By “novel
information,” we mean associations, hypotheses, or trends that are not explicitly present in the text
sources being...

Robert Nisbet, John Elder, Gary D. Miner 2009, page 174 Handbook of Statistical Analysis and Data
Mining Applications

and machine learning systems could be used to automate parts of the screening process. Text mining
uses natural language processing to discover knowledge and structure from unstructured data (i.e.,
text) (O’Mara- Eves et al. 2015). O’Mara- Eves et al. found that text mining methods for screening
have modeled human

Margaret J. Foster, Sarah T. Jewell, 2022, page 177 Piecing Together Systematic Reviews and Other
Evidence Syntheses: A Guide for Librarians

Rowman & Littlefield Publishers

Text mining refers to mining the content of unstructured data, in the sense that this data source
may not reside in a structured database but is more likely in an unstructured file. In this respect,
text mining refers to discovering new insights by

Rajiv Sabherwal, Irma Becerra-Fernandez, 2013, page 87

Business Intelligence: Practices, Technologies, and Management, wiley

...(metadata) from text files such as consumer comments, e-mail conversations, physician or
technician notes, work orders, etc. Basically, text mining creates structured data out of
unstructured data. Text mining is a very powerful technique to show during an envisioning process,
as many business stakeholders have struggled... (Schmarzo, 2016, p. 122)

...techniques such as latent semantic analysis (§ 14.3) have roots in information retrieval. Text
mining is sometimes used to refer to the application of data mining techniques, especially
classification and clustering, to text. While there is no clear distinction between text mining and
natural language processing (nor between data mining and machine learning), text mining is typically
less concerned with linguistic structure, and more interested in fast, scalable algorithms.
Jacob Eisenstein, 2019, page 5

Introduction to Natural Language Processing MIT Press

‘Text mining’ generally refers to the process of deriving information and discovering associations
from unstructured textual data, whereas ‘data mining’ searches for patterns in structured data
(e.g., database).  Utilizing both approaches, referred to as ‘dual mining,’ one can seek to extract
concepts from the discovered...

...and may not be available for purchase worldwide.  Sophisticated text mining packages look likely
to offer opportunities for information specialists to develop new approaches to gathering,
grouping and interrogating large numbers of bibliographic records and other documents.  Sensitive
bibliographic database searches from a...

Paul Levay, Jenny Craven, 2019, page 165.

Systematic Searching: Practical ideas for improving results

American Library Association

...more necessary to properly utilize this vast source of knowledge.  Text mining, therefore,
corresponds to the extension of the data mining approach to textual data and is concerned with
various tasks, such as extraction of information implicitly contained in collection of documents, or
similarity-based structuring.

Arun K. Pujari, 2001, page

Universities Press,

Text mining or text analytics “is an artificial intelligence technology that uses natural language
processing (NLP) to [normalize unstructured] data for analysis or to drive machine learning (ML)
algorithms” (“What Is Text Mining, Text Analytics and Natural Language Processing?”  2019).

Lynn Silipigni Connaway, Marie L. Radford, 2021, page 216

ABC-CLIO, Research Methods in Library and Information Science, 7th Edition

Text mining is the use of machine learning and data mining techniques on textual data.  is data
consists of natural language documents that can be more or less structured, ranging from completely
unstructured plain text to documents with various kinds of tags containing machine-readable
semantic information.

Claude Sammut, Geoffrey I. Webb, 2011, page 398

Springer, Encyclopedia of Machine Learning

Text mining involves three steps: information retrieval (i.e., finding relevant materials to be
mined), information extraction (identifying specific information from the materials found), and
data mining (finding meaningful associations among the units of information extracted) (Meystre
et al., 2008). Meystre...

Kenneth Bordens, Bruce Barrington Abbott, 2014, page 246

McGraw-Hill Education, Ebook: Research Design and Methods: A Process Approach


Text mining is also used to discover hidden information (‘descriptive method’), for example new
research fields (in filed patents), or information to be added to marketing databases on customers'
areas of interest and plans. It can even be used by a business wishing to communicate with its
customers in the vocabulary... (Tufféry, 2011, p. 629)

This general knowledge can then be applied in newly encountered situations. When the data consist
of texts, then data mining is often called text mining. The treatment of intelligent data analysis here
follows the line of reasoning in Ref. (2).

...importance of a given word relative to other words in the document and in the corpus. It’s a
commonly used representation scheme for information-retrieval systems, for extracting relevant
documents from a corpus for a given text query. The intuition behind TF-IDF is as follows: if a word
w appears many times in a...

Sowmya Vajjala, Bodhisattwa Majumder, et. al., 2020, page

O'Reilly Media, Practical Natural Language Processing: A Comprehensive Guide to Building …

Text mining is a specialized form of data mining. While data mining primarily focuses on analyzing
structured numerical data, text mining interprets words and concepts in context.

Efraim Turban, Carol Pollard, Gregory Wood, 2021, PAGE

Wiley,  Information Technology for Management: Driving Digital Transformation to …

In general, the text mining process converts unstructured text into numerical data and applies data
mining techniques. For the Triad example, we converted the text documents into a document-term
matrix and then applied hierarchical clustering to gain insight on the different types of comments.

Jeffrey D. Camm, James J. Cochran, et. al., 2020, page

Cengage Learning, Business Analytics

...and can be integrated into the data architecture with ease. The primary focus of the text mining
algorithms are to process text based on user-defined business rules and extract data that can be
used in classifying the text for further data exploration purposes. Semantic technologies play a vital
role in integrating the...

Krish Krishnan, 2013, page 202.

Data Warehousing in the Age of Big Data, Elsevier Science

• Text Mining: The process of extracting high-quality information from text is known as Text
Mining. It is also termed as text data mining. Text recognition, customer care, personalized bots, and
sentiment analysis are examples of NLP applications in text mining.

V.D. Ambeth Kumar, S. Malathi, V.E. Balas, 2021, page

IOS Press, Smart Intelligent Computing and Communication Technology

Techniques and methods of Text data mining

In text clustering, documents are represented using a bag-of-words representation [8].

Several tasks approached by using text mining techniques, like text categorization,
document clustering, or information retrieval, operate on the document level, making use of
the so-called bag-of-words model. Other tasks, like document summarization, information
extraction, or question answering, have to operate on the... (Kao & Poteet, 2007, p. 4).

Experiments have shown that a relatively small number of words is sufficient to capture the
topic of a cluster, and that properties of these terms, such as inverse document frequency,
remain reasonably stable as new documents are added to a cluster. Best clustering results,
in terms of misses versus false alarms, were...

from  Natural Language Processing for Online Applications: Text Retrieval, Extraction …
by Peter Jackson, Isabelle Moulinier
John Benjamins Pub.,  2007  ⦁  Arts  ⦁  Science

Traditional document clustering based on bag-of-words leads often to excessively high


dimensionality and data sparsity. Topic modeling methods such as latent semantic analysis
(LSA) and latent Dirichlet allocation (LDA) can be applied to document clustering, but either
ignore word co-occurrence or suffer from... (Kamath, Liu, & Whitaker, 2019, p. 239)

...of similarity and difference of content throughout the collection. Two fundamental


techniques applicable here and provided by most text-mining software packages are text
classification (see Classification of Text, Automatic) and document clustering. Text
classification involves mapping each text to one or more...

(view in book)

from  Encyclopedia of Language and Linguistics


by Keith Brown
Elsevier Science,  2005  ⦁  Arts  ⦁  Science

...it means there is no need of human interference for clustering of documents. In text
document clustering, a group of words (texts) is used on a set of text documents for
discovering such text documents having with the given set of words (texts). Further such
discovered text documents for the given set of words (texts)... (Rao, 2021, p. 828)

Word clusters are called topics. Latent semantic indexing [32] is a spectral clustering
method, where each document is represented by a histogram of word counts over a
vocabulary of fixed size. The problems of polysemy and synonymy are considered.

from  Neural Networks and Statistical Learning


by Ke-Lin Du, M. N. S. Swamy, page
Springer London,  2019  ⦁  Science

...algorithm to derive word senses from corpora. The clustering was carried out in word
space, a real- valued vector space in which each word in the training corpus represents a
dimension and the vectors represent co- occurrences with those words in the text. Sense
discovery is carried out by using one of two...

(view in book)
from  The Oxford Handbook of Computational Linguistics
by Ruslan Mitkov, page 636
Oxford University Press,  2022  ⦁  Arts

TFIDF is a very common value representation for terms, but it is not necessarily
optimal. If someone describes mining a text corpus using bag of words it just
means they’re treating each word individually as a feature. Their values could be
binary, term frequency, or TFIDF, with normalization or without. (Provost & Fawcett,
2013, p. 256)

...which means we identify parts of speech for each word. We retain nouns for this analysis
as they have shown to effectively summarize the content of text data—although this is
likely dependent on length and the context of the corpus (see Hoffman et al., 2018). As is
common in natural language processing, we weight...

(view in book)

from  The Oxford Handbook of Social Networks


by Ryan Light, James Moody
Oxford University Press, Incorporated,  2020  ⦁  History and Biographies

telling of conversational or writing topic than of linguistic style. Our meaning extraction


method begins by using a word-counting program that ranks all the content words in a
corpus by frequency of use. The most frequently occurring content words across all texts
are compiled into a LIWC dictionary, and their patterns...

(view in book)

from  The Content Analysis Reader


by Klaus Krippendorff, Mary Angela Bock
SAGE Publications,  2009  ⦁  Arts  ⦁  History and Biographies

The set Vi therefore defines a cluster digest of Ri. In the context of text data, word clusters
are just as important as document clusters because they provide insights about the topics
of the underlying collection. Most of the methods discussed in this book for document
clustering, such as the scatter/gather method,... (Aggarwal, 2015, p. 438)

...of reviews in a given domain. This approach first identifies nouns and noun phrases using
a POS tagger, and then counts their occurrence frequencies using a data mining algorithm,
keeping only the frequent nouns and noun phrases using an experimentally determined
frequency threshold. Hu and Liu (2004) used...

(view in book)

from  Sentiment Analysis: Mining Opinions, Sentiments, and Emotions


by Bing Liu
Cambridge University Press,  2020  ⦁  Current events  ⦁  History and Biographies  ⦁  Medicine and
Health  ⦁  Reference  ⦁  Science
...of keywords that we may be able to use in our later aggregation and analysis. Examples of
the use of text processing within OSINT include the use of word counting models that
produce word clouds for quick overviews of documents or corpora. Meanwhile the use of
TF-IDF is demonstrated by Federica Fragapane in her...

(view in book)

from  Open Source Intelligence Investigation: From Strategy to Implementation


by Babak Akhgar, P. Saskia Bayerl, Fraser Sampson
Springer International Publishing,  2017  ⦁  History and Biographies  ⦁  Reference  ⦁  Science

The software can then use this understanding of the parsed text. The words are then
processed using a data mining clustering algorithm to get the relationships. My point here
is that a text analytics application that is developed specifically for the recruitment industry
understands the context of the industry and is...

(view in book)

from  Building a Data Warehouse: With Examples in SQL Server


by Vincent Rainardi
Apress,  2008  ⦁  Science

...functions served in speech and writing, such as content elaboration and interaction of
communicative participants. The frequency of each feature is counted in each of the texts,
and statistical techniques are used to empirically group the linguistic features into clusters
that co-occur with a high frequency in texts.

(view in book)

from  Variation Across Speech and Writing


by Douglas Biber
Cambridge University Press,  1991  ⦁  Arts

Just like with clustering methods, we can identify words that are indicative of a particular
topic, as we did in Table 13.1. The most straightforward method for obtaining these words is
to select the highest probability words in each topic. (Grimmer, Roberts , & Brandon, 2022, p.
152)

You might also like