Introduction To Information Science - (9 Digital Technologies and Data Systems)
Introduction To Information Science - (9 Digital Technologies and Data Systems)
Data science is the process by which the power of data is realised, it is how we
find actionable insights from among the swathes of data that are available.
David Stuart (2020, xvi)
Copyright © 2022. Facet Publishing. All rights reserved.
Introduction
Technology, from the Greek techné, meaning art, skill or craft, is usually
taken to mean the understanding of how to use tools, in the broadest sense
of the word. The term information technology was first used in the 1950s to
describe the application of mechanised documentation and then-new digital
computers, and became widely used in the 1980s to describe the wider-
spread use of digital technology (Zorkoczy, 1982).
Information technology is usually associated with computers and
networks. But, in a wider sense stemming from the original meaning of the
word, the technologies of information include all the tools and machines
which have been used to assist the creation and dissemination of information
Digital technologies
We will describe these aspects only in outline; see Ince (2011) and Dasgupta
(2016) for more detailed but accessible accounts, and Primiero (2020) for a
more advanced summary; for accounts of the historical development of the
computer, see Ceruzzi (2012) and Haigh and Ceruzzi (2021).
Any digital device represents data in the form of binary digits or bits.
Patterns of bits, bit strings with bits conventionally represented by 0 and 1,
may represent data or instructions. A collection of eight bits is known as a
byte. Quantities of data are represented as multiples of bytes, for example: a
kilobyte, defined as 210 bytes, i.e. 1024 bytes.
Any character or symbol may be represented by a bit string, but this
requires an agreed coding. The most widely used code since the beginning
of the computer age was the American National Standards Institute’s ASCII
(American Standard Code for Information Interchange) 7-bit code, but it is
limited to the Latin alphabet, Arabic numerals and a few other symbols. It
Copyright © 2022. Facet Publishing. All rights reserved.
1940s (Figure 9.1). His design, which gave the first formal description of a
single-memory stored-program computer, is shown in Figure 9.2.
Figure 9.1 John von Neumann (Alan Richards photographer. From the Shelby White and
Leon Levy Archives Center, Institute for Advanced Study, Princeton, NJ, USA)
Copyright © 2022. Facet Publishing. All rights reserved.
This architecture is general purpose, in the sense that it can run a variety
of programs. This distinguishes it from special-purpose digital computers
which carry out only one task, as in digital cameras, kitchen appliances, cars,
etc. A von Neumann machine loads and runs programs as necessary, to
accomplish very different tasks.
The heart of the computer, the processor, often referred to as the central
processing unit (CPU), carries out a set of very basic arithmetical and logical
operations with instructions and data pulled in from the memory, also
referred to as main or working memory. This sequence is referred to as the
fetch–execute cycle. Two components of the processor are sometimes
distinguished: an arithmetic and logic unit, which carries out the operations,
and a control unit, which governs the operations of the cycle.
While programs and data are being used they are kept in the memory.
While not being used they are stored long-term in file storage. Items in
memory are accessible much more rapidly than items in file storage, but
memory is more expensive, so computers have much more file storage than
memory.
Data comes into the computer through its input devices and is sent into the
outside world through the output devices. The components are linked
together through circuits usually denoted as a bus, sometimes referred to
more specifically as a data bus or address bus.
All of these components have undergone considerable change since the
first computers were designed. Processor design has gone through three
main technological stages. The first generation of computers used valves as
processor components, and the second generation used transistors.
Computers of the third generation, including all present-day computers, use
circuits on ‘silicon chips’, by very-large-scale integration (VLSI), which
Copyright © 2022. Facet Publishing. All rights reserved.
Networks
Virtually all computers are connected via some form of network to others, to
enable communication and information access and sharing. Since the 1990s
the internet and the World Wide Web have become ubiquitous, such that it
has become difficult to think about digital technologies without these at
centre stage.
The growth of networked computing has been driven by three factors:
communications technology, software and standards. In terms of
technology, older forms of wired network, originating in the twisted copper-
pair cables used for telegraph systems from the mid-19th century, have been
succeeded by fibre-optic cables and various forms of wireless transmission.
Software
Software, in its various forms, provides the instructions for digital
computers to carry out their tasks and can be categorised into systems
software (firmware and operating systems) and application software; for short
overviews, see Ince (2011), White and Downs (2015) and Dasgupta (2016).
with this name; calculate the mean of this column of figures; insert this
image into this blog post; and so on.
All software must be written in a programming language, though this will
be invisible to the user, who will have no reason to know what language any
particular application is written in. At the risk of over-simplification, we can
say there are four kinds of software language: low-level and high-level
programming languages, scripting languages and mark-up languages.
Low-level programming languages, also referred to as assembly languages
or machine code, encode instructions at the very detailed level of processor
operations; these languages are therefore necessarily specific to a particular
type of computer. This kind of programming is complex and difficult, but
results in very efficient operation; it is reserved for situations where reliably
fast processing is essential.
and its steps are converted into statements in an appropriate language; this
is the process of coding or programming. For small-scale software
development, perhaps for a piece of code to be used temporarily or just by
one person, writing the code and testing that it works correctly is sufficient.
For a major piece of software, other stages will include design, testing and
maintenance; these stages, together with the coding stages, make up system
development or software engineering. For accessible introductions to these
topics, see Louridas (2020) and Montfort (2021).
One influence on the way software is created is the availability of libraries
of subroutines, packages of code in a particular language to undertake
specific tasks which can be included in programs by a single call. If a
subroutine library (also called a code library) is available for a particular topic,
it means that a coder can create a program very quickly and economically by
a series of calls to subroutines, without having to write the code in detail; see
Wintjen (2020) for detailed examples. There are an increasing number of
subroutine libraries available for topics of relevance to the information
disciplines. For example, the Python language has libraries of subroutines
for text processing, for extracting data from URLs, for extracting data from
mark-up languages such as HTML and XML and for processing data from
Wikipedia.
An innovation in the way software is provided has come from the open
source movement. Commercial software is usually provided as a ‘black box’;
the user has no access to the program code and therefore cannot modify the
system at all, nor even know exactly how it works. With open source software,
the user is given the full source code – the original programs – which they are
free to modify. This makes it easy to customise software to meet local needs
and preferences. It also allows users to collaborate in extending and
improving the systems, correcting errors, etc. Most such software is free,
leading it to be known as FOSS (Free and Open Source Software). Well-known
examples of full FOSS systems relevant to library and information
management are the Koha and FOLIO-ERM library management systems, the
EPrints repository system, the Drupal content management system and the
Open Journal Systems (OJS) e-journal publishing system; see, for example,
Breeding (2017) and Choo and Pruett (2019). Many smaller-scale FOSS
software products are available; the GitHub repository is the most commonly
used means for making these available (Beer, 2017). Github is commonly
integrated with the Jupyter notebook environment, which combines simple
word processing with the ability to create and run programs; this integration
allows for testing, sharing and distribution of open source software and is
widely used for data-wrangling functionality, including library/information
Copyright © 2022. Facet Publishing. All rights reserved.
Artificial intelligence
Artificial intelligence (AI) encompasses a complex array of concepts and
processes, directed to making it possible for digital computers to do the kind
of things that minds can do; see Boden (2016), Cantwell Smith (2019) and
Mitchell (2019) for accessible overviews, Floridi (2019) for ideas of its future
and Mitchell (2021) for caveats about how much we can expect from AI,
especially with regard to endowing machines with common knowledge and
common sense. It is ‘a growing resource of interactive, autonomous, self-
learning agency, which enables computational artefacts to perform tasks
that otherwise would require human intelligence to be carried out
successfully’ (Taddeo and Floridi, 2018, 751).
Although AI as such has been possible only since the middle of the 20th
century, it has been preceded by several centuries of interest in mechanical
automata, game-playing machines, calculating machines, automated
reasoning and the like; see Pickover (2019) for an informal history. Although
Ada Lovelace mused on a ‘calculus of the nervous system’, and the
possibility of mechanical devices composing music, AI became a prospect
only with the development of the digital computer.
Alan Turing’s 1950 article ‘Computing Machinery and Intelligence’ was
the first serious philosophical investigation of the question ‘can machines
think?’ In order to avoid the problem that a machine might be doing what
might be regarded as thinking but in a very different way from a human,
Turing proposed that the question be replaced by the question as to whether
machines could pass what is now known as the Turing Test. Turing proposed
an ‘Imitation Game’, in which a judge would interact remotely with a human
and a computer by typed questions and answers, and would decide which
was the computer; if the judge could not reliably decide which was the
computer, then the machine passed the test, as Turing believed that digital
computers would be able to do. Turing’s article also proposed for the first
time how a machine might learn, and noted a series of objections to the idea
of AI which are still raised today.
Copyright © 2022. Facet Publishing. All rights reserved.
Data systems
The growth of data science, and the increased emphasis on managing and
curating data, as distinct from information, has been a major influence on
information science since 2010. For an accessible general introduction, see
Kelleher and Tierney (2018), for an overview aimed at library/information
applications, see Stuart (2020), and for a more detailed and practical
treatment, see Shah (2020) and Wintjen (2020). For an analysis of the idea of
data itself, see Hjørland (2020), and for examples of data practices in science,
including the issues discussed later in this chapter, see Leonelli and Tempini
(2020). We discuss this under two main headings: data wrangling, the
processes of acquiring, cleaning and storing data; and techniques for finding
meaning in data.
Copyright © 2022. Facet Publishing. All rights reserved.
Data wrangling
The term data wrangling is sometimes used to refer to all activities in the
handling of data. We will consider them, and the digital tools which support
them, under four headings: collecting; storing; cleaning; and combining and
reusing. A useful overview from an information science perspective is given
in Stuart (2020, chapter 4).
Collecting
Collecting data is most simply achieved by typing it in from scratch. If it
already exists in printed form, with a recognisable structure, then optical
character recognition (OCR) may be used, though this is rarely error-free,
and entry must be followed by cleaning. If the data exists in some digital
form, then it may be selected and copied; a tool such as OpenRefine may be
useful in collecting and parsing data, and in converting between formats.
If the required data is to be found on one or more web pages, then a
process of web scraping can be used to obtain it. This is the automatic
extraction of information from websites, taking elements of unstructured
website data and creating structured data from it. Web scraping can be done
with code written for the purpose, or by using a browser extension such as
Crawly or Parsehub. This is easier if an Applications Programming Interface
(API) can be accessed. Described as ‘an interface for software rather than
people’, an API in general is a part of a software system designed to be
accessed and manipulated by other programs rather than by human users.
Web APIs allow programs to extract structured data from a web page using
a formal syntax for queries and a standard structure for output; see Lynch,
Gibson and Han (2020) for a metadata aggregation example.
Storing
Data may be stored in a wide variety of formats and systems. These include
database systems such as MySQL, Oracle and Microsoft Access, and data-
analysis packages such as SAS or SPSS. Most common is the spreadsheet,
such as Microsoft Excel and Google Sheets, widely used for data entry and
storage, with some facilities for analysis and visualisation of data. All kinds
of structured data and text are stored in a two-dimensional row–column
arrangement of cells, each cell containing one data element. Consistency is
of great importance, in the codes and names used for variables, the format
for dates, etc., and can be supported by a data dictionary, a separate file
Copyright © 2022. Facet Publishing. All rights reserved.
stating the name and type of each data element (Broman and Woo, 2018).
Exchange formats are relatively simple, used to move data between more
complex formats or systems. One widely used example is the CSV (Comma
Separated Values) format. It holds data in plain text form as a series of values
separated by commas in a series of lines, the values and lines being
equivalent to the cells and rows of a spreadsheet. JSON (Javascript Object
Notation) is a more complex XML-like format used for the same purposes; it
can represent diverse data structures and relationships, rather than CSV’s
two-dimensional flat files.
Data collections are typically bedevilled by factors such as missing data,
duplicate data records, multiple values in a cell, meaningless values and
inconsistent data presentation.
Cleaning
Data-cleaning processes are applied to create tidy data as opposed to messy
data. For a data sheet, tidy data requires that: each item is in a row; each
variable is in a column; each value has its own cell; there are no empty cells;
and the values of all variables are consistent.
To achieve this, data may be scanned by eye, and errors corrected
manually. An automated process is more efficient and less error prone.
Examples are the use of Python scripts for identification of place names in
collections of historical documents (Won, Murrieta-Flores and Martins,
2018) and for the normalisation of Latin place names in a catalogue of rare
books (Davis, 2020); see Walsh (2021) for a variety of examples. Data-
handling packages, such SAS and SPSS, have their own data-cleaning
functions.
One specific data-cleaning tool has been much used by library/infor-
mation practitioners, as well as by scholars in the digital humanities.
OpenRefine, a tool designed for improving messy data, provides a
consistent approach to automatically identifying and correcting errors,
grouping similar items so inconsistencies can be readily spotted by eye and
identifying outliers, and hence possible errors, in numeric values. Examples
of its use include the cleaning of metadata for museum collections (van
Hooland et al., 2013) and extraction in a consistent form of proper names
from descriptive metadata (van Hooland et al., 2015).
The most precise means of algorithmic data cleaning uses Regular
Expressions (RegEx), a subset of a formal language for describing patterns in
text, which ensures consistency. They were devised originally within
theoretical computer science by the American mathematician Stephen Kleen
in the 1950s, using the mathematics of set theory as a way of formally
Copyright © 2022. Facet Publishing. All rights reserved.
These expressions are used primarily for searching, parsing, cleaning and
processing text, and also for the formal modelling of documents and
databases; see applications to inconsistencies in bibliographic records
(Monaco, 2020), errors in DOIs (Xu et al., 2019), metadata aggregation
(Lynch, Gibson and Han, 2020) and identification of locations in the text of
journal articles (Karl, 2019).
carried out using specialist subroutine libraries with the Python and R
programming languages. A variety of techniques may be used, including:
association, correlation and regression; classification and clustering; analysis
of variance and model building; and time series. When dealing with large
sets of data, and typically investigating many possible correlations and
interactions, it is essential to have a thorough understanding of statistical
significance, ways to guard against misleading and chance findings.
Data visualisation
Data visualisation is now easier than ever before, but the results are not
always pleasing or informative. There are many ways of achieving this,
including routines in coding libraries, functions in spreadsheet and database
Text visualisation
There are many ways of visualising textual data; a survey by Linnaeus
University identified 440 distinct techniques (Kucher and Kerren, 2015).
Few, however, have gained wide use.
The most common way of visualising text documents or collections of
documents is the word cloud, a relatively simple tool for producing images
comprising the words of the text, with greater prominence given to the
words that appear most frequently. They are popular for giving quickly
assimilated summaries and comparisons of the contents of lengthy
documents, but have been criticised for giving overly simplistic views of
complex data. The same is true of other simple text-visualisation methods,
such as the termsberry and the streamgraph. Some suggestions for
improvement by manipulating the layout of word clouds to make them
more meaningful are given by Hearst et al. (2020).
Copyright © 2022. Facet Publishing. All rights reserved.
Text mining
Text mining involves analysing and displaying the structure of texts. There
are a variety of means to achieve this. Coding with specialised libraries, such
as the Python language’s TextBlob, gives greatest flexibility. General-
purpose qualitative analysis packages, such as nVivo and MAXQDA, can be
used for text mining, but are complex solutions to what may be only a
simple need. Web-based environments for analysis, visualisation and distant
reading of texts, such as Voyant Tools, provide a variety of analyses,
including word clouds, word-frequency lists, common words shown in
context, and occurrence of terms in different sections of a web document.
Network analysis software such as VOSviewer can be used with bodies of
Summary
Digital technologies and data systems form the bedrock of the infosphere.
An understanding of their nature and significance, and an ability to make
practical use of relevant aspects, is essential for all information professionals.
• The technologies of information include all the tools and machines which have
been used to assist the creation and dissemination of information throughout
history.
• An understanding of the basics of computer and network architecture, coding, AI
and HCI and the principles of data science is essential for all information
professionals.
Key readings
Dasgupta, S. (2016) Computer science: a very short introduction, Oxford: Oxford
University Press.
Ince, D. (2011) The computer a very short introduction, Oxford: Oxford University
Copyright © 2022. Facet Publishing. All rights reserved.
Press.
Shah, C. (2020) A hands-on introduction to data science, Cambridge: Cambridge
University Press.
Stuart, D. (2020) Practical data science for information professionals, London: Facet
Publishing.
References
Ahmed, W. and Lugovic, S. (2019) Social media analytics: analysis and visualisation
of news diffusion using NodeXL, Online Information Review, 43 (1), 149–60.
Alpaydin, E. (2016) Machine learning, Cambridge MA: MIT Press.
Arango, A. (2018) Living in information: responsible design for digital places, New York:
Two Waves Books.
Beer, B. (2017) Introducing GitHub: a non-technical guide (2nd edn), Sebastopol, CA:
O’Reilly.
Boden, M. (2016) AI: its nature and future, Oxford: Oxford University Press.
Breeding, M. (2017) Koha: the original open source ILS, Library Technology Reports,
53 (6), 9–17.
Broman, K. W. and Woo, K. H. (2018) Data organization in spreadsheets, American
Statistician, 72 (1), 2–10.
Cantwell Smith, B. (2019) The promise of artificial intelligence: reckoning and judgement,
Cambridge MA: MIT Press.
Ceruzzi, P. E. (2012) Computing: a concise history, Cambridge MA: MIT Press.
Choo, N. and Pruett, J. A. (2019) The context and state of open source software
adoption in US academic libraries, Library Hi Tech, 37 (4), 641–59.
Coeckelbergh, M. (2020) AI ethics, Cambridge MA: MIT Press.
Dahya, N., King, W. E., Lee, K. J. and Lee, J. H. (2021) Perceptions and experiences of
virtual reality in public libraries, Journal of Documentation, 77 (3), 617–37.
Dasgupta, S. (2016) Computer science: a very short introduction, Oxford: Oxford
University Press.
Davis, K. D. (2020) Leveraging the RBMS/BSC Latin places names file with Python,
Code4Lib Journal, issue 48. Available at https://journal.code4lib.org/articles/15143.
Dick, M. (2020) The infographic: A history of data graphics in news and communications,
Boston MA: MIT Press.
Eck, N. and Waltman, L. (2017) Citation-based clustering of publications using
CitNetExplorer and VOSviewer, Scientometrics, 111 (2), 1053–70.
Engard, N. C. (2014) More library mashups: exploring new ways to deliver library data,
London: Facet Publishing.
Floridi, L. (2019) What the near future of artificial intelligence could be, Philosophy and
Technology, 32 (1), 1–15.
Copyright © 2022. Facet Publishing. All rights reserved.
McCarthy, J., Minsky, M. L., Rochester, N. and Shannon, C. E. (1955) A proposal for
the Dartmouth summer research project on artificial intelligence. Available at
https://chsasank.github.io/classic_papers/darthmouth-artifical-intelligence-
summer-resarch-proposal.html.
Miller, A. (2018) Text mining digital humanities projects: assessing content analysis
capabilities of Voyant Tools, Journal of Web Librarianship, 12 (3), 169–97.
Minsky, M. (1987) The society of mind, London: Heinemann.
Mitchell, M. (2019) Artificial intelligence. A guide for thinking humans, London:
Pelican.
Mitchell, M. (2021) Why AI is harder than we think. arXiv: 2104.12871. Available at
https://arxiv.org/abs/2104.12871.
Monaco, M. (2020) Methods for in-sourcing authority control with MarcEdit, SQL,
and Regular Expressions, Journal of Library Metadata, 20 (1), 1–27.
Montfort, N. (2021) Exploratory programming for the arts and humanities (2nd edn),
Cambridge MA: MIT Press.
Nilsson, N. J. (2012) John McCarthy 1927–2011: a biographical memoir, Washington
DC: National Academy of Sciences. Available at
www.nasonline.org/publications/biographical-memoirs/memoir-pdfs/mccarthy-john.pdf.
Pangilinan, E., Lukos, S. and Mohan, V. (eds) (2019) Creating augmented and virtual
realities, Sebastopol CA: O’Reilly.
Pickover C. A. (2019) Artificial intelligence: an illustrated history, New York: Stirling.
Primiero, G. (2020) On the foundations of computing, Oxford: Oxford University
Press.
Robinson, L. (2015) Immersive information behaviour: using the documents of the
future, New Library World, 116 (3–4), 112–21.
Rolan, G., Humphries, G., Jeffrey, L., Samaras, E., Antsoupova, T. and Stuart, K.
(2019) More human than human? Artificial intelligence in the archive, Archives
and Manuscripts, 47 (2), 179–203.
Rosenfeld, L., Morville, P. and Arango, J. (2015) Information architecture: for the web
and beyond, Boston MA: O’Reilly.
Shah, C. (2020) A hands-on introduction to data science, Cambridge: Cambridge
University Press.
Sharp, H., Preece, J. and Rogers, Y. (2019) Interaction design: beyond human–computer
interaction (5th edn), Indianapolis: Wiley.
Stuart, D. (2020) Practical data science for information professionals, London: Facet
Publishing.
Taddeo, M. and Floridi, L. (2018) How AI can be a force for good, Science, 361 (6404),
751–2.
Tokarz, R. E. (2017) Identifying criminal justice faculty research interests using
Voyant and NVivo, Behavioral and Social Sciences Librarian, 36 (3), 113–21.
Copyright © 2022. Facet Publishing. All rights reserved.
White, R. and Downs, T. E. (2015) How computers work (10th edn), Indianapolis IN:
Que.
Williams, B. (2020) Dimensions and VOSviewer bibliometrics in the reference
interview, Code4Lib Journal, issue 27. Available at
https://journal.code4lib.org/articles/14964.
Wintjen, M. (2020) Practical data analysis using Jupyter Notebook, Birmingham: Packt.
Won, M., Murrieta-Flores, P. and Martins, B. (2018) Ensemble Named Entity
Recognition (NER): evaluating NER tools in the identification of place names in
historical corpora, Frontiers in Digital Humanities, 5.2.
DOI:10.3389/fdigh.2018.00002.
Xu, S., Hao, L., Zhai, D. and Peng, H. (2019) Types of DOI errors of cited references
in Web of Science with a cleaning method, Scientometrics, 120 (3), 1427–37.
Zorkoczy, P. (1982) Information technology: an introduction, London: Pitman.
Copyright © 2022. Facet Publishing. All rights reserved.