Big Data Driven Investigation Into The Maturity of - 2023 - The Journal of Acade
Big Data Driven Investigation Into The Maturity of - 2023 - The Journal of Acade
A R T I C L E I N F O A B S T R A C T
Research data management (RDM) poses a significant challenge for academic organizations. The creation of
Original content: Research data set for
library research data services (RDS) requires assessment of their maturity, i.e., the primary objective of this
publication "Big data-driven investigation of
study. Its authors have set out to probe the nationwide level of library RDS maturity, based on the RDS maturity
the maturity of library research data services
(RDS)" (Original data)
model, as proposed by Cox et al. (2019), while making use of natural language processing (NLP) tools, typical for
big data analysis. The secondary objective consisted in determining the actual suitability of the above-referenced
Keywords: tools for this particular type of assessment. Web scraping, based on 72 keywords, and completed twice, allowed
Research data management (RDM) the authors to select from the list of 320 libraries that run RDS, i.e., 38 (2021) and 42 (2022), respectively. The
Research data services (RDS) content of the websites run by the academic libraries offering a scope of RDM services was then appraised in
RDS maturity some depth. The findings allowed the authors to identify the geographical distribution of RDS (academic centers
Academic libraries of various sizes), a scope of activities undertaken in the area of research data (divided into three clusters, i.e.,
Datafication compliance, stewardship, and transformation), and overall potential for their prospective enhancement.
Data librarianship
Although the present study was carried within a single country only (Poland), its protocol may easily be adapted
for use in any other countries, with a view to making a viable comparison of pertinent findings.
* Corresponding author.
E-mail address: [email protected] (M. Nahotko).
https://doi.org/10.1016/j.acalib.2022.102646
Received 11 August 2022; Received in revised form 5 November 2022; Accepted 16 November 2022
Available online 29 November 2022
0099-1333/© 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
in quite surprising ways, in a different place and time, potentially by their datafication level. These data can be managed systematically,
anyone with access to the new communication technologies infrastruc taking due account of local specifics.
ture. Bibliographic and catalogue data, assembled and made available in Thirdly, the study addresses the maturity of library services related
the libraries, have long been the focus of academicians looking for the to research data (RDS) in the area through the assessment of the library
resources for quantitative research which makes use of the big data resources related to research data management (RDM). This means that
techniques. In recent years, bibliographic data science has been pro the research data are treated in the article on a multi-level basis (Fig. 1).
posed as a new approach to bibliographic data, particularly catalogue On the one hand, data librarians use the tools such as the library web
metadata (Lahti et al., 2019). sites to post their information resources relevant to the information
It has been noted that a systematic analysis of library catalogues needs of their users. In this case, there were the academic investigators
yields a wealth of information on the historical dynamics of knowledge who create and subsequently use as well as reuse the research data in the
production and dissemination (Tolonen et al., 2019). In bibliographic research cycle. In this way, the librarians run data management and a
data science, library metadata is treated as a research object based on scope of data curation activities which are collectively termed as data
more general open science and data science paradigms, also treated as librarianship (Semeler et al., 2019). On the other hand, analysis of these
open data science (Uhlir & Schröder, 2007). Data mining in relation to resources, aided by big data tools, may identify their specific properties
library data is also construed as bibliomining (Tu et al., 2021, 3), i.e., the which are in fact indicative of the level of RDS maturity. It is therefore
study and mapping of user behavior and resource use. This practice fa possible to determine the current status and development directions of
cilitates collecting unstructured data, as they are being generated by the the library's RDS, its benefits for the users, and key conditions for further
librarians during routine library management and service delivery development, as this actually pertains to the individual level of profes
processes, or by the library users themselves (Liu & Shen, 2018), unlike sional expertise to be commanded by the librarians.
the bibliographic (structured) data generated through cataloguing. The maturity of RDS is construed as being in continuous evolution, i.
This study addresses three closely related research areas. Firstly, we e., evolving from primary to highly advanced level (Tiwari & Madalli,
are going to focus on these unstructured library data which are less 2021), with principal focus resting on professional excellence. While
frequently analyzed, but, when analyzed, may offer some valuable in maturity might imply the completeness of these services, this is not the
sights as to how overall user satisfaction level might be boosted, and the desired status to be actually achieved, but rather to be pursued as an
value of information services at large (Ahmed & Ameen, 2017). The evolutionary progress towards a specific objective. The maturity
overall aim consists in presenting the potential for making use of un assessment is one of the conventional approaches used for determining
structured data and information resources collected on the websites of the level of sophistication in the range of the services rendered (Kouper
academic libraries as the data sources for quantitative scientific et al., 2017). The maturity model is a tool used for assessing the current
research. By way of gaining insights into the RDM processes of academic status and any future prospects for any attendant process, person, or a
libraries, as available on their websites, the maturity of the RDS pro group of individuals involved in RDS. It is also used to identify the
vided by these libraries was assessed. Academic libraries usually offer on specific factors that might prove instrumental in a prospective
their websites a variety of information about managing research data at enhancement of this service.
different stages of their lifecycle (Lewis, 2010). Assessment of the actual Even though the actual scope of the present study is confined to the
content of the library websites pertaining specifically to research data library environment within a single country (Poland), the study protocol
allowed the article authors to determine overall maturity level of their is easily adaptable to pursue similar, comparative studies in other
services. countries. Any internationally pursued studies stand a good chance of
Secondly, the analyses of unstructured data were carried out, while providing much broader insights into the select processes, prevalent
making use of the techniques typical for big data and natural language trends, and the actual impact of a specific geographical location on RDS.
processing (NLP) which allows the use of data and information resources This in turn might well offer valuable pointers as to what might be
created to support library users for other purposes, previously not anticipated in that regard in the foreseeable future.
anticipated by the creators of these data and information resources. The
big data techniques applied, e.g., web scraping, make it possible to use Theoretical background: related literature
the information resources in a similar research not only with regard to
the libraries, but also to other GLAM sector institutions (i.e., galleries, Currently, intensive research is carried out in the areas referenced in
libraries, archives, and museums), as well as other institutions providing the Introduction. A number of earlier published studies are reviewed
a scope of cultural/intellectual services. Our research demonstrated that further below, divided into four key areas.
library websites contain resources that may be used as an indicator of
2
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
Datafication, data science and data scientists whereas the actual relation of data to what they represent is delegated to
other fields of knowledge and academic domains (Naur, 1974).
The rapid development of data recording, processing, and storage The review of literature on the subject attests to the fact that the term
technologies resulted in a dynamic increase in overall volume of data science is defined in many ways, even though two distinct ap
assembled research data. Such growth dynamics pose a challenge in proaches may readily be identified:
their routine use and reuse. Extracting specific body of knowledge from
large collections of research data requires their prior reduction and • a general, chronologically older approach, in line with which the
aggregation, i.e., a process termed datafication (Mayer-Schönberger & data science, apart from dealing with the processing of large data sets
Cukier, 2013). This entails presenting the phenomena under study in a (in order to extract knowledge), also covers the issues related to the
quantified form, and in an aggregated manner, thus enabling their collection, development, management, and storage of large data sets
further processing. (Ratner, 2017; Stanton et al., 2012),
The greater frequency of large data sets processing, along with rapid • a narrower, currently dominant approach proposes reducing the data
development of the methods and techniques of such processing have science down to the techniques, methods, and tools used to obtain
brought about manifest changes in the perception of body of knowledge specific values and insights from the data sets (Provost & Fawcett,
being extracted in this way, as well as of the extraction process itself. 2013; Song & Zhu, 2016; Amirian et al., 2017; Kelleher & Tierney,
Such trends are well exemplified by the data-centric way of perceiving 2018).
reality, effectively shaping one's mindset, as well as setting one's mental
processing algorithms into the “higher authority” mode (Osika, 2021). Along with defining the meaning of the term data science, there is an
Entrenching oneself in such an assertion may well lead to “data-ism”, ongoing debate on the legitimacy of establishing it as fully autonomous,
that is, an actual belief in the limitless potential of algorithms as effec brand-new academic discipline, instead of regarding it as a mere
tive tools for reading the “world paradigms”, as well as deeming the extension of statistical methods (Cleveland, 2001; Diggle, 2015). Some
information inflow the absolute value within its own right (Brooks, researchers also associate data science with such terms as business an
2013; Harari, 2017, 75). alytics, operations research, business intelligence, competitive intelli
As the key definitions in this field have not been standardized as yet, gence, data analysis and modelling, and knowledge extraction from big
this has given rise to numerous terminological considerations regarding data (Foreman, 2013; Kelleher & Tierney, 2018).
the processes of transforming the pertinent resources into a digital form, Regardless of the arguments related to data science terminology, the
as well as the tools for the “objectification of knowledge-forming pro literature on the subject highlights the importance of specialist (pro
cesses” (Goczyła, 2021). For the purpose of this study, the authors use fessional/domain) expertise, which tangibly aids understanding and
the term ‘datafication’, construed as the process in which the social appreciation of the true problems and realities of specific fields of
activities, their agents, objects, and routine practices are transformed knowledge. This way the actual application of data science in resolving
into quantified digital data (Mayer-Schönberger & Cukier, 2013). In this true-to-life challenges in particular academic domains is put into its
process, components of reality, from physical quantities through socio proper context (Sanchez-Pinto et al., 2018). Recent years have brought
demographic properties, to a body of human experience and emotions, more widespread application of data science in many fields - from
are subject to the processes of registration, quantification, aggregation, business and economics, through public policy, to science and education
and algorithmisation, with a view to finding their equivalent in the data (Voulgaris, 2014). The acquired body of knowledge can help describe
stream (Iwasiński, 2020). what happened, explain why something happened, and predict what
The value of research data, especially those generated in the course might happen (Baškarada & Koronios, 2017), contributing to the opti
of publicly financed research, stems from the need and potential for their mization and increase in the efficiency of the processes (Granville,
systematic sharing and reuse (Feger et al., 2020; Karasti et al., 2006). 2014).
Open access to these resources and their dissemination boasts numerous Indubitably, there is a growing demand for staff boasting sufficient
advantages, as compared to a closed system which significantly hinders expertise in broadly construed data resource management, focused
access and any subsequent reuse of such data. It aids maximizing the predominantly on data analytical skills. Data analytics has even been
research potential of new digital technologies and communication net described as the most attractive (sexiest) job of the 21st century
works, thus boosting the revenues generated from public research in (Davenport & Patil, 2012, 70). No clear definition of either a data sci
vestment (Arzberger et al., 2004). entist, or a data manager has as yet been established and adopted. At the
An increased volume of data in digital format has triggered the need moment, these professionals are known by a variety of titles, e.g., data
for pursuing formal research into the area. A diversity of phenomena steward, data analyst, data engineer, data curator, depending on the
occurring throughout the “research data lifecycle” have become the specific stages of data management and handling within the data life
subject of academic inquiries, investigations, debates, as well as pose cycle they happen to be involved in. Overall expertise combined with a
numerous research questions and hypotheses. This has ultimately multitude and sheer diversity of tasks that actually make up the pro
prompted the emergence of the field of science known as data science fession of an academic researcher make it downright impossible to be
(also called e-science), defined as “an interdisciplinary field using sci found in a single individual. No one can be an expert in so many areas at
entific methods, processes, algorithms, and systems to extract insights the same time. This merely goes to show there is a need to build the
from many structured and unstructured data” or, shorter still, “the study teams of individuals well adept in many different areas of expertise
of the generalizable extraction of knowledge from data” (Dhar, 2013). required for RDM tasks (Schutt and O'Neil, 2014, 10).
Data science is related to data mining, machine learning, and big data Datafication, a single aspect of this broad area of expert knowledge,
analysis, and it uses statistics, data analysis, machine learning, expertise is being investigated in terms RDM and the attendant services which
in specific domains, and related methods, in order to comprehend, support the processes (RDS). It is construed as an area of interest and an
appreciate, and analyze data (Hayashi, 1998). essential practical activity of academic libraries which have taken up
The authorship of the term data science is attributed to Peter Naur, overall responsibility for building the research data repositories and
Danish IT pioneer and Turing Award winner. The researcher was supporting both the scientists pursuing their research, and the university
probably the first one to use this term in 1960. He proposed that data administrators in their management of research processes (Laskowski,
science was the study of the generalizable extraction of knowledge from 2021).
data (Naur, 1974; Wainer, 2015, 2; Virkus & Garoufallou, 2019).
Further uses of the term emerged during the 1960s and 70s, construing it
as the science of dealing with data, once they have been established,
3
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
Data librarianship and research data management (RDM) highlight the importance of RDM, the training of researchers in basic
RDM skills, and guidance to RDM tools and resources; 2) expert services,
Data science may well be deemed a new area of research for IT offering decision support and customized solutions to RDM problems
professionals and librarians, otherwise called data librarians (Semeler indicated by the researchers; 3) curation services for technical infra
et al., 2019) aiming to address the issues related to data management structure and related services that support RDM throughout the research
and analysis. Data librarians focus on disseminating important research process (Bryant et al., 2017).
findings in the form of relevant information by collecting, organizing, Kim and Syn (2021) distinguished six stages in the data lifecycle: 1)
and managing data from multiple sources. They function as the facili creating new data or reusing existing data; 2) description (metadata); 3)
tators of research at all stages of scholarly research cycle, by providing collecting data in an appropriate format; 4) data analysis to achieve
all kinds of data, information, and services required in the data man results; 5) dissemination in data journals, repositories, etc.; 6) long-term
agement and curation process. Hence, their scope of activities is all the archiving. The stages were mapped into three types of library services:
more closely tailored to the requirements and tasks pursued by the re education, expertise, and curation. Yoon and Schultz (2017) surveyed
searchers within the scope of RDM (Al-Jaradat, 2021). At the same time, 185 websites of US academic libraries and categorized their research
the data librarian becomes a data scientist who uses his expertise on data data resources into four main areas of interest, i.e., service, information,
and computing to create brand-new, data-driven products and services education, and network. In the conclusions, they asserted that libraries
that support new forms of data-intensive scholarship (Tang & Hu, 2019). should be more involved in the delivery of RDS, the provision of online
The field of data librarianship constantly and systematically probes the information, and the creation of educational services. Of particular note
competencies and responsibilities of its professionals in terms of current was the fact that there were no services related to the iterative use of
and future e-science trends. To take this even further, some professionals data.
believe that a data librarian should possess the same standard of skills as Kouper et al. (2017) proposed a typology of RDS implemented in
a data scientist (Xiu & Wang, 2014). libraries. Based on the literature and their own research, they distin
Currently, data librarianship is focused on creating new library ser guished three groups of services, which allows for uniform comparison
vices based on new approaches to supporting RDM and curating digital of services between institutions and determining their maturity. This
data created during scientific research (Koltay, 2017; Singh et al., 2022). breakdown is as follows:
One of the stages of RDM maturity development is the application of the
principles, practices, and resources of traditional librarianship to • core group: Data management plan (DMP), assistance and mandate
research data. It is, therefore, posited that data librarians should have support, consultations, and instructions, best practices,
the technical skills and knowledge necessary for data management, data • intermediate group: data deposit and repositories, archiving and
curation, and competences to support RDM as the necessary basis for the preservation, metadata, storage, sharing and reuse,
provision of RDS supporting data science and scientific communication • advanced group: data processing and analysis, data curation,
(Tenopir et al., 2014). The expectation is that data librarians should be acquisition and citation, copyright, software and hardware, policies,
able to collaborate with scientists carrying out research during all stages data reference.
of their work in the documentation processes, ensuring that these
operate effectively (Xu et al., 2022). McCaffrey and Giesbrecht (2016) The participants in the RDM processes comprise scientists, data
presented several skills and competencies of data librarians, identifying managers, IT professionals, librarians, and archivists, among others
three main areas of activity for this profession: a) data management and (Whyte, 2014). For this reason, satisfying the diverse needs of all the
curation; b) data visualization and geospatial representation; and c) participants involved in the data lifecycle may not be assigned to a single
advanced information services. department within any organization. RDM, as interdisciplinary specialty
The qualifications and competencies of the data librarian is likely to operates under the conditions of mixing knowledge and skills (Ran et al.,
affect the quality of the RDM services, the immaturity of which re 2021). Therefore, collaboration and partnership between the library, the
searchers have pointed out. For example, researchers have observed the IT department, and other internal and external units of the organization
immaturity of RDM services in libraries, such as the lack of appropriate are essential for the development and success of RDM services (Delser
union catalogues (Yoon, 2017). A 2019 study indicated that 35 % of one, 2008; Pinfield et al., 2014; Verbaan & Cox, 2014).
libraries in the US offered data curation services, and a further 15 % Cox et al. (2019) believe that RDM services could bring about a
were preparing to implement them (Yoon & Donaldson, 2019). Data fundamental change in the role played by academic libraries in the
management planning (DMP) and data organization/description were institution due to their increasing involvement in supporting scientific
the two most frequently proposed services while later studies indicated research through deeper involvement in the research process, even to
training and data storage in repositories as being a priority (Yidavalapati the librarians participating directly as a member of the research team.
et al., 2021). Another important aspect is the decline in the role of the library as
According to Whyte and Tedds (2011), RDM is concerned with the providers of literature from outside the organization and a shift towards
organization of data from its entry into the research cycle to the the internal production of data, information, and knowledge which
dissemination and archiving of valuable results. It aims to effectively brings more focus on curation, preservation, and reuse. This, in turn,
verify the results and allows for new and innovative research based on demands new competencies among library professionals especially with
existing information. RDM is usually associated with the preservation regard to activities such as data analysis, visualization, and research
and curation processes, the former dealing with the provision of long- integration, thus, making them data librarians (Berman, 2017).
term retention of data in repositories, or at a data journal publisher Despite discrepant concepts as to what exactly constitutes library
for reuse. The latter ensures that the results of a research project can be RDS, there are some commonly identified features of these services. One
archived and reused many times. Issues of data quality are also impor is the consulting services, seen as having a distinct advantage over the
tant (Al-Jaradat, 2021). many practical or general technological activities carried out by the li
Another definition states that library RDM encompasses many ac braries (Tenopir et al., 2017). This may be related to the practice of
tivities and processes related to the research data lifecycle, including matching RDS to the scope of services traditionally provided by the li
data design and creation, storage, protection, preservation, retrieval, braries. Another common phenomenon is the lack of RDM competency
dissemination, and reuse, considering technical feasibility, ethical and among the library staff (Xu et al., 2022). In general, librarians show a
legal considerations, and organizational framework (Cox & Pinfield, strong commitment to RDM activities, although to a different extent,
2014). Similarly, the activities of RDM are presented by the OCLC, depending on the region and the actual type of library.
proposing three categories of these activities: 1) educational services to
4
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
Big data in libraries bibliographic data science, where linguistic and literary phenomena
are studied using metadata analysis. Further on in the study, we will
Understanding big data is often intuitive (Garoufallou & Gaitanou, present an example of such research dealing with the state of data
2021, 411; Li et al., 2017). Laney (2001) presented the characteristics of fication of science in Poland on the basis of library data.
big data as data that cannot be processed by traditional data manage
ment tools. Others (Dumbill, 2013) suggested a definition of big data as Library RDS maturity
data that exceed the processing capacity of conventional database sys
tem, calls for finding alternative ways of processing them. Big data are In this study, unstructured big data from the libraries were used to
data generated constantly, automatically, and rapidly (Reinhalter & study the maturity of information systems with particular reference to
Wittman, 2014). Business leaders from around the world have for years RDS. Maturity assessment is a frequently implemented approach to
recognized the importance and value of big data analytics due to its determine the level of sophistication of a process, service, or a product
enormous operational and strategic potential, while generating a de (Kouper et al., 2017, 159). The concept of maturity is usually construed
mand for data specialists (Manyika et al., 2011; Provost & Fawcett, very broadly as something fully developed or perfect (Cooke-Davies,
2013). 2004, 1234). Maturity models have been developed that describe the
In academic libraries there is a large amount of data, both structured improvement process and set the directions for development, including
and unstructured (Tella & Kadri, 2021), covering the entire spectrum, the criteria and indicators defining the existing and desired state. The
from structured metadata sets, e. g., Online Public Access Catalog ideas of Total Quality Management (TQM) are often indicated as their
(OPAC) data, electronic content in various forms, and large amounts of source, which is confirmed by the early work of Crosby (1979), who
unstructured data of all kinds, including those made available on the presented the Quality Management Maturity Grid, containing five levels
Internet. These data are supplemented with constantly growing re of maturity moving from ad hoc tasks to a fully implementable and
sources of information from various sources available on the Web, systematic approach.
commercial and free. Collectively, academic libraries are now having to Since the end of the last century, maturity models have been applied
deal with a variety of data resources in the form of not only traditional to evaluate the development of software systems. One of the early,
books, journals, and bibliographic data but also all other forms of data better-known examples is the Capability Maturity Model for Software
such as textual documents, metadata, images, audio and video files, (CMM-SW), which was developed in the 1990s at Carnegie Mellon
research data, software, and 3D collections. This means the library University to facilitate the US government's procurement of appropriate
involvement with big data (Garoufallou & Gaitanou, 2021). software (Yang et al., 2016). The purpose of this model was to evaluate
The source of unstructured data is mainly the content of library software processes in order to move organizations away from a chaotic
websites and social media sites, such as Facebook and Twitter. As re and unplanned development process towards a more disciplined and
ported by Gardner et al. (2008), websites are increasingly used by aca optimized one. The creators of the model sought to distinguish between
demic libraries to promote their services and resources to users- immature and mature programming organizations by arguing that the
researchers. These websites contain unstructured content on matters, former was focused on reacting and solving currently emerging prob
such as library collections (collection management, scientific commu lems on an ongoing basis, while the latter was based on sound man
nication, collection policy, links to commercial databases) and library agement techniques.
services (e.g., service for unregistered users, information about ILL, This model has been used in work aimed at assessing and improving
circulation, reference, and reservations). Gardner et al. (2008, 19) RDM activities in research projects (Qin et al., 2015; Qin et al., 2017).
distinguished such categories of content posted on library websites as Following the earlier model of Humphrey's (1989) CMM (Capability
those about the library, current problems (including scientific commu Maturity Model), they presented five levels of maturity in relation to
nication, open access, digital library, institutional repository), collec RDM:
tions (electronic collections, catalogues/databases, citation
management software, new acquisitions, special and multimedia col • RDM entry level, based on the competences of individuals and their
lections), services (loans, acquisition, research support, teaching sup individual efforts, making the work unreliable;
port), and contact details. Pareek and Gupta (2012) indicated such • The management level, based on procedures and policies individu
information about the services of Indian libraries as loans, Internet ac ally developed for each project, which makes it difficult to adopt
cess, information services, reprographic services, and bibliographic solutions between projects;
services. Manjunatha (2016), while examining the websites of eight li • Defined level, characterized by accepted and repeatable procedures
braries, distinguished such categories of information as general infor that can be used in several projects;
mation about the library, information about the library's collections, • Quantitatively manageable level that allows the use of measures that
information about library services, the availability of electronic re facilitate the evaluation of processes and progress;
sources, and links. • Optimal level at which weaknesses and inefficiencies are identified
According to Ball (2019), there are three ways to apply big data in and then actively removed.
academic libraries, namely: data sources, data analysis (library analysis)
and data visualization. These uses of big data by librarians can lead to The maturity levels are determined in the model for five main pro
overall improvement of library services based on a better understanding cesses and areas of practice: 1) data management in general; 2) data
of the needs of users and their information seeking behavior (Provost & collection, processing, and quality assurance; 3) description and repre
Fawcett, 2013). Planning services based on this kind of knowledge sentation of data; 4) data dissemination; 5) repository services and
should result in increased efficiency of library services, greater resis preservation. For each area of practice, there are also pertinent reference
tance to the hacker actions, and the search possibilities boost (Hoy, values for its evaluation.
2014; Tella & Kadri, 2021; Zhan & Widén, 2019). In this approach, the A more empirical approach was proposed by Kouper et al. (2017),
library is a place where it is possible to use, store, and organize data, who presented only three levels of maturity: basic (creating founda
including big data. tions), intermediate (organization and standardization) and advanced
Our idea is to use these (unstructured) data for analysis that goes (monitoring and optimization), but as many as eight areas of research: 1)
beyond traditional library activities. It concerns the study of phenomena leadership (vision, strategy, culture); 2) services; 3) users and stake
that are more general than those directly related to the information holders; 4) supporting the research life cycle; 5) management; 6) costs
activity of libraries. In this case, library data are used there to detect and budgeting; 7) cooperation between units; 8) human capital. The
phenomena occurring outside the library, as in the above-referenced model creates a matrix with areas in the rows and maturity levels in the
5
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
columns, and it has two goals - identifying weaknesses and setting pri from this list.
orities, as well as functioning as a communication tool with the library From this list, the web addresses of 320 academic libraries were
administration. According to these authors, mature RDS programs offer taken. Then, those that offer RDS and provide information on this topic
activities best suited to the mission of the library and its parent on their websites had to be selected.
institution.
Cox et al. in 2017 proposed a four-tier RDM maturity model, which
Data analysis: web scraping
was modified in 2019. In this new version of the model, more attention
was paid to activities such as data analysis and visualization, as well as
In the next step, based on the list of addresses taken from EBIB,
their integration (Table 1).
websites were selected based on their content using the web scraping
These levels are categorized according to the existence or absence of
technique. Web scraping is automated data extraction from the Internet
services and support, compliance, skills, roles and structures, practices,
whereby it is possible to obtain structured data as a result of unstruc
and cultural acceptance.
tured textual data transformation (see Arbia & Nardelli, 2020; Lunn
et al., 2020; Regueira et al., 2020; Uzun, 2020, 61726). As explained by
Methodology of RDS maturity investigations
Yan (2020) and Dongo et al. (2020), as well as others, web scraping is
particularly suitable for processing large data sets because of the auto
Previous studies of library electronic resources conducted outside of
mation of repetitive activities, which allows for quick and accurate data
libraries were concerned with collected structured data, mainly
acquisition from the Internet.
descriptive metadata (Lahti et al., 2019) or unstructured data, such as
The web scraping method was used in the study to obtain a list of
digitized texts of publications from library collections (Leetaru, 2015).
web addresses of Polish academic libraries from the publicly available
The study of the content of library websites was of an auxiliary nature, as
EBIB list and to search the obtained web addresses for the presence of
a supplement to methods such as surveys and interviews (Singh et al.,
keywords in the content of their source code. A list of 72 keywords
2022, 3). So far, there has been no research on the websites of academic
containing terms related to RDM issues was prepared (see Annex 1). This
libraries (unstructured data) as a source of information on trends in RDS.
list is based on the expertise of the members of the research team. These
This approach is represented in the studies presented in this section and
keywords have been prepared in such a way as to take into account the
is based on the findings from the hypothesis that many phenomena that
inflection of the Polish language. This list enabled the selection of web
function in science (including research data processes, such as RDM), are
addresses for further evaluation using big data tools for data analysis
reflected in the activities of academic libraries via their websites.
and visualization.
Due to the relationship between data librarianship and data science,
The adopted web scraping procedure consisted of several stages and
it is possible to study the level of datafication of scientific organizations
included analysis of the source code of the web pages, searching the
with the use of big data methods. Data librarianship focuses on creating
pages with their internal subpages according to the list of keywords, and
new library services – RDS, and transforming existing scientific
downloading and saving the results to files in CSV format. The script
consulting services based on new ways of managing and curating digital
code for web scraping was written in the Google Colab environment
research data. Therefore, given the big data methods, especially in
using the Python programming language and its selected libraries, i.e.,
combination with NLP tasks (Gudivada et al., 2015) used in data
BeautifulSoup,1 requests2 and urllib.3 In addition, libraries for data
librarianship, it is possible to determine the degree of maturity of these
analysis and processing were also used: CSV4 and pandas5 (for saving
activities which may help improve the practical effects of data librari
data to result files) and re6 (for natural language processing in terms of
anship. Therefore, this is an area that deals with the results of the work
the conjugation of keywords in accordance with the rules applied in the
of data librarians and big data scientists.
Polish language). Then, the resulting CSV files were then processed into
Applying data librarianship to academic library data allows the
MS Excel spreadsheets for later analysis.
conduct of science mapping analysis, which is a set of methods and
The web scraping procedure with the use of a prepared list of key
techniques designed for creating science maps (Petrovich, 2021). These
words was performed several times, in mid-2021 and 2022, which made
methods help to find answers to questions such as:
it possible to determine the dynamics of the changes in the number and
content of selected websites. The need to repeat the research resulted
• What is the level of library RDS maturity and how is it different in
from the belief that the situation in the library RDS area in Poland is very
terms of the type of library?
dynamic. This way of operating allowed us to capture these dynamics. In
• In which disciplines is RDM better and in which less developed?
2021, 53 library websites were found where keywords from the list were
• In which research centers are research data issues of interest?
detected, but further analysis resulted in the rejection of a part of them,
• What are the relationships between respective research data
so only 38 libraries remained. This resulted from the unsuitability of the
problems?
content of some of the library websites, where, for example, there was
only a link from the branch library page to the main library's RDM
The answers to these questions help in understanding the ways in
website. In this case, only the main library website content was
which the structural units of science are interrelated at different levels
included, and the subpage content was discarded as unsuitable for
(Leydesdorff, 1987).
further analysis. In 2022, there were 42 such libraries, in that six new
libraries appeared, including libraries from two large classical univer
Data sources: web pages of academic libraries (unstructured data)
sities and two medical universities. All websites were extracted as a
result of them being full-text searched (including their internal sub-
The investigative team opted to include the websites of all academic
pages) in terms of the incidence of words from the original list of key
libraries in Poland in the first phase of the study. For this purpose, the
words in their source code. For the list included 72 keywords returns
existing list of web addresses of these libraries available on the librar
ians' portal called EBIB (http://www.ebib.pl/biblioteki/baza/) was
used. The list is constantly updated, which increases its pertinence. It 1
https://pypi.org/project/beautifulsoup4/.
was necessary to analyze the source code of the EBIB web page, as it was 2
https://docs.python-requests.org/en/latest/.
a frame structure. Thanks to this, access to the html source code was 3
https://docs.python.org/pl/3.8/library/urllib.html.
obtained, which allowed to save the list of libraries as an html file. The 4
https://docs.python.org/3/library/csv.html.
elements that originally facilitated the use of the list on the EBIB web 5
https://pandas.pydata.org/docs/.
site, but made it difficult to perform web scraping have been removed 6
https://docs.python.org/3/library/re.html.
6
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
Table 1
Cox et al. (2019) model of evolving maturity of the RDM
Levels Skills Activities
were obtained for 26 keywords, which means that 46 keywords from the Data visualization
list were not found on the library websites. The text content of the li
brary websites selected in both years was then analyzed using big data The previous stages of work allowed the isolation of selected key
tools. words and marking their popularity on the websites of respective li
Only RDM-related web content was submitted for further analysis. braries, considering changes in the degree of maturity of their RDS over
For this purpose, the content of the websites has been reformatted to time. Before doing a deeper analysis, however, we tried to find answers
plain text. The scraped websites varied greatly in size. After being to what these keywords could indicate about the geographical and
translated into English, which was necessary due to the requirements of institutional distribution of the phenomena they represented. The pre
the analytical tools applied (see the part entitled Data visualization), sentation of all locations obtained with the web scraping technique on
they counted from 48 to almost 22,000 characters without spaces (5010 the map allowed us to go beyond single keywords, websites, and ser
characters on average). The total number of characters in the sample vices, and to view all this information resource and discover what it said
was 185,371. The total number of words was 33,289. about the part of the world under study.
The translation of the content of the selected websites into English Fig. 2 indicates on the Poland's map the spatial distribution of aca
was carried out in two steps. First, the text was translated using Google demic libraries selected by web scraping. The Tableau Desktop tool was
Translate. Then the result was analyzed by researchers in order to used, which allows for the creation of interactive visualizations. A bar
remove any errors that hindered further analysis using NLP tools. There graph was used for each academic center, indicating the ID count values.
were not too many translation errors, although the principal focus rested The ID count used is taking into account the diversity of the intensity and
on correcting the actual mapping of terms (what Google Translate is maturity of RDS, which indicates the “density” of research data issues on
useful for), whereas any stylistic deficiencies were of considerably lesser the websites of the selected libraries. The ID count is the indicator of the
significance for the task at hand. “saturation” of the website with relevant keywords, taking into account
the depth of the hierarchical structure of the websites. For each library,
7
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
in a given year, all combinations of pages and keywords in the results are offer services that target the local needs of the researchers they serve.
counted. The ID count can then be represented as the product of two These services may, for example, include the provision of so-called dark
values: the keyword hits and the number of pages where the keywords data (Heidorn, 2008). On the other hand, attention should be paid to the
were found. More precisely: ID count ∈ <MAX(keywords hit, pages hit); frequent need to improve the maturity of the services in the tail. In
keywords hit x (pages hit+1)>. At a minimum, the ID count takes the general, it can be said that the development trend of the library RDS is
value of the greater number of pairs: keywords hit and pages hit. correct: apart from a few large services that support the big science,
Fig. 2 indicates that the distribution of libraries offering RDS and the many smaller ones for small science are created.
intensity of the materials they offered did not change rapidly. The places Fig. 4 indicates averaged ID count values with a breakdown into li
where these services were offered were the largest academic centers in braries of various types of universities. They show the average size of the
Poland (Warsaw, Kraków, Wrocław, Poznań, Gdańsk, Lublin), as evi research data websites, which directly indicated their maturity and their
denced by the comparison of the bar graphs in Fig. 2, showing the ID usefulness for users. Lower values in 2022 for universities (classical - 14/
count for academic center in two years and with the pie charts showing 16 libraries) and technical universities (13 libraries in both years) might
the size of the academic center. This size is determined by the number of have resulted from changes in these services. In 2021, this was content
universities operating in a specific geographic area. A dot instead of a scattered across many different library websites, where RDM pages were
circle means the number of universities below four. Subsequent libraries often combined with open access pages or other materials. From 2022,
that offered new websites for research data were usually located in the libraries created more compact websites, separate from other content,
same places (academic centers), strengthening them in terms of RDS usually consisting of a subpage devoted only to research data manage
offered to users. It is similar in other countries, e.g., in the USA, as ment services. The very high results for economic/business universities
indicated by research by Tenopir et al. (2015). However, in Poland, in were due to their small number, with only two libraries from this group
smaller academic centers, there were also library RDS with a correlation offering websites for research data, and one of them was very extensive.
coefficient between the size of the academic center and that of the RDS Using analytical tools, attempts were also made to analyze the con
being 0.52 (moderate correlation). tent of library sites in the direction of their maturity and usefulness to
Fig. 3 indicates the distribution of libraries according to their ID library patrons. We used free tools available on the Internet, the use of
count. We have organized library websites along the axis from large to which turned out to be very intuitive. The results of these analyses, aided
small in terms of their volume and keywords retrieved. The large ser by VOSviewer program are presented in Figs. 5 and 6. The first visualizes
vices, containing tens of thousands of words and many keywords, are on the co-occurrence matrix for the selected term (“data management”),
the left-hand side of the axis, with smaller web sites sorted by decreasing which allows for an intuitive presentation of terms related to research
size trailing off to the right. The main area under the right-hand side of data. To achieve this effect, it was necessary to prepare the so-called
the curve is the long tail of library services. These services are more corpus file. A corpus file is a text file that contains on each line the
difficult to find and therefore less frequently used. The long tail meta text of a document. The text of a document must be in English, since the
phor comes from commercial activities and serves to represent the dis natural language processing algorithms used by VOSviewer do not
tribution of popularity of electronic information objects (Anderson, support other languages. The size of the nodes in network presented in
2004). This power law distribution is also used to describe scientific Fig. 5 indicates the frequency of phrases. The thickness of the lines is
activity by size of project and scale of data production (Borgman et al., proportional to the proximity of the phrases.
2016). This is because the long tail can be generalized into all Using the VOSviewer program, three clusters (color-coded in Fig. 5)
technology-induced areas, and science certainly belongs in there. were selected from the content of the research data websites, containing
The situation described in Fig. 3 is also in line with the Pareto 20/80 a total of 70 phrases, the so-called “items”, connected together within a
principle, as approximately 20 % of libraries offer users 80 % of re cluster and between clusters. Items are represented by their labels and
sources on research data. However, these small local sites (located in the by default also by a bullet. The size of the label and the bullet of an item
tail) can have a significant impact on the maturity level of RDS as they is determined by the weight of the item. The first of these clusters
8
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
80
68
70
60
53.5
50
ID count value
40
31 31
30 25
19.25
20
9
10 6
4 3 2.3 2.5
0
Universies Technical Econ./business Agriculture Medical Universies of
(classical) Universies Universies Universies Universies sport
2021 2022
marked in red refers to the creation of research data and their use in visualization of the density of phrases (keywords) was created (Fig. 6).
research (knowledge-generating activities). The second, in green, in In this visualization, each point on the map has a color that depends on
dicates the materialization of data in various forms, and the last, in blue, the phrase density at that point. By default, this color goes from blue to
represents the tools used in RDM and research data applications. green to yellow. The greater the number of phrases next to each other in
To obtain more information about selected phrases, a graph of the point and the higher the weight of these phrases, the yellower it
9
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
becomes. Conversely, the smaller the number of phrases in the vicinity article were Cirrus, Summary, Trends, Reader, and Context.
of the point and the lower their weight, the closer to blue is the color. The TermsBerry tool is also useful in the analysis, as it allows us to
The map in Fig. 6 shows that the phrases “data”, “research data”, “re group the most frequent words in the form of a cluster, as well as to
pository”, and “plan” have a higher density than others, which means observe the co-occurrence relationship between them. This tool provides
that these phrases have strong relationships with others (Chen et al., a way of exploring high-frequency terms and their collocates (words that
2016). The others have a similar density of occurrences. occur in proximity). The highest-frequency words appear at the middle
Another way to present the most important issues raised on the li and in larger bubbles. The user can hover the cursor over the selected
brary websites is to present their content in the form of a data cloud word, which turns green (Fig. 8) while all words in relation to it turn
(Fig. 7) which was obtained with the help of the Voyant Tools program pink. Each bubbles indicate the collocate frequency for that word. The
that provides multiple text mining functions (Gregory et al., 2022). This darker the pink, the more often the word is used. The number of oc
is a web tool that can analyze a single text document or several docu currences common to the selected word is indicated below the related
ments combined into one body. It allowed us to extract words repeated word. As shown in Fig. 8, the words “management”, “repository/re
in the text with different frequencies and display them in different vi positories” as well as “plan”, “open”, and “sharing” were in the closest
sualizations. For clarity of the image, a stop-word list was used that relation to the keyword “data”, the most frequent in the set.
contains words that should be excluded from the results. Stop-word lists The most important issues presented in the tag cloud confirmed the
contain so-called function words that do not carry as much meaning, topics previously identified in the clustering process. After using a stop-
such as determiners and prepositions. The word cloud positions the word list, removing punctuation, and stemming, the five most common
words in the text so that the terms that occur the most frequently are words in the corpus were “data” (1902), “research” (915), “repository/
positioned centrally and sized the largest. The color of words and their repositories” (381), “management” (293), and “open” (263). These are
absolute position are not significant. Automatic text analysis allows for general terms but allowed us to define the subject area of the web
semi-automatic determination of its subject area, which in this case was resource. There were also less frequent, but important, terms such as
the use of RDS in the libraries under study. Voyant Tools can also help “plan” (193), “scientific” (170), “metadata” (163), “sharing” (145),
validate the results of quantitative text analysis. The word cloud dis “access” (130), and “DMP” (91). The main issues were RDM and DMP as
tribution indicated that words such as “research”, “data”, “manage well as repositories and metadata for data present in the educational and
ment”, “open”, “metadata”, “project”, “use”, “sharing”, “access” training materials on the library websites.
appeared frequently, thus pointing to their importance. Using the same tool, an attempt was made to estimate the maturity
Visualization is the greatest advantage of Voyant Tools, with over 25 level of Polish academic libraries. For this purpose, the RDS maturity
different visualization formats to understand and explain the data con model by Cox et al. (2019) was used (Table 1). This model provides three
tained in the text corpus. The most useful tools for the authors of this clusters of expressions that characterize successive levels of RDS
10
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
maturity. Compliance for cluster 1 means library responses to main Results and discussion
drivers, such as funders and the continuation of existing practices. It is
generally characterized by a formal policy coupled with advisory ser Libraries and other similar institutions offer very large volume of
vices. Stewardship for cluster 2 means continuation of traditional roles unstructured data (Zeng, 2019) and datafication means processing them
of libraries, but in relation to data repositories and journals and not to not only into machine-readable materials (as in digitization), but also
books and other publications. It is associated with the creation of a re into machine-processable materials. They are the most varied in type,
pository and associated services. Transformation (cluster 3) means deep nature, and quality and also pose the most problems during processing.
transformation of library services to support high-level analytical ac This study set out to analyze text documents available on library web
tivities. For each cluster, marked by a different color, the frequency of sites related to RDM. Under the conditions of textual processing of in
expressions (words or phrases) on the library web pages and the average formation, text matching was the main means of full text search for a
of the occurrences were determined. Expressions related to the scope of long time. Recently, content-based search has been used in relation to
the activities for each cluster, referenced in the model, were searched. text semantics. Instead of relying on full text searching, methods such as
The results obtained are visualized in Fig. 9. semantic-based analysis, extraction, mining, and tagging are used to
The expressions in cluster two had the highest frequency on the li facilitate the discovery of information from unstructured data.
brary websites, whereas those in cluster three appeared with the lowest The data collected from the websites of academic libraries using the
(minimum) frequency. Therefore, our task was to determine the ex web scraping technique and then analyzed with the aid of several NLP
pressions appearing on the tested websites of libraries. Their distribution tools allowed for the determination of the RDS maturity level of these
at individual maturity level was determined by the adopted RDS libraries and, thus, their ability to meet the information needs of their
maturity model (Cox et al., 2019). The level of maturity of the RDS users in the field of research data services. The analysis of the website
surveyed (in this case, Polish academic libraries) is determined by the content of Polish academic libraries showed that the maturity of the RDS
frequency of occurrence of a particular expression on library websites. offered by these libraries was at good or average level, in line with the
Higher frequency of expressions in a given cluster provider that a spe Cox et al. (2019) maturity model. The main issues raised on the websites
cific level of maturity has been reached. In this research process, co- concerned research data repositories, DMP creation, data storage, and
occurrence and other relations (visualized earlier) also were taken metadata. On the other hand, there was almost no interest in new skills,
under consideration. In this way, the data obtained with the aid of web placed in the third cluster of the model, such as visualization, integra
scraping and other NLP techniques were linked to the maturity model tion, or data analysis, which, in opinion of Cox et al. (2019) demonstrate
created by other methods. the high maturity level of RDS. These services were best developed in
11
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
large academic centers with the highest number of large universities. science uses metadata that should be regarded as structured data. In the
However, this is not a rule of thumb, as RDM initiatives are also being study, we indicated the potential for analyzing the content of Web
launched in smaller academic centers. It is also worth noting that, using documents placed on library websites, which is closer to what is
web scraping, it was possible to select only about 14 % of libraries from described as bibliomining, or data mining for libraries (Nicholson,
the total number of these institutions taken into account. Other libraries 2003). Therefore, it can be concluded that, unlike structured data used
do not provide any information on their websites regarding RDM ser in bibliographic data science, in this article we deal with unstructured
vices issues or at least web scraping was unable to locate them. data. Based on the big data methods described in the article, the inte
The level of RDM maturity is also related to the type of institution. gration process occurred, allowing the transition from unstructured data
Research has shown that RDM processes are most developed in uni to structured data, whereby such data become ‘smarter’ (Simović, 2018;
versity (classical) and technical university libraries. This confirms the Zeng, 2019) and at the same time, data can be affected by the phe
hypothesis adopted about the correlation of the activity of libraries nomenon of datafication. Big data technology has shown unique ad
(visible on their web sites) with the activity of home universities in the vantages in processing and analyzing unstructured data (Kitchen &
field of research data, as RDM seems to play the most important role in McArdle, 2016). Information made available by the libraries may be
research conducted at these types of universities. This is because the regarded as ‘small data’, but following their aggregation they become
former include the largest universities with large libraries and extensive big data. During these processes, the volume of big data is of lesser
faculty, while the latter are the most technologically advanced. importance when compared to their value, as there is the possibility of
Numerous experimental studies are conducted on both. On the other arriving at important findings from such data on any scale, large or small
hand, the lesser interest in RDM in libraries of medical and agriculture (Schramm & Shafaghi, 2020). Such methodology differs significantly
universities is disturbing, where the services on the library websites from the methods used so far in research on the maturity of RDS (e.g.,
were often limited to placing a link to a national subject repository for Cox et al., 2017; Cox et al., 2019; Kim, 2021), which mainly used so
medical sciences called the Polish Platform of Medical Research (PPM), ciological methods, such as interviews, surveys, and/or observations. If
which, incidentally, is an example of a well-functioning community web analysis was applied (e.g., Kouper et al., 2017; Singh et al., 2022), it
cooperation. was cast merely in an adjunct role.
Various types of data are used in the library RDS. Bibliographic data
12
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
Fig. 9. Frequency of RDS expression clusters. Yellow: cluster 1 - Compliance: Translation of existing skills. Red: cluster 2 - Stewardship: Reskilling of existing staff.
Blue: cluster 3 - Transformation: Acquisition of new skills. Vertical lines: average frequency in the cluster. (For interpretation of the references to color in this figure
legend, the reader is referred to the web version of this article.)
Conclusions may also be considered. However, despite these shortcomings, the pre
sent study points to the value of using big data tools and procedures in
The findings from this research indicate the value of using big data this type of analysis. Thanks to their application, a clear picture of RDS
techniques in RDS maturity testing, such as web scraping, and analytical in Polish libraries was obtained, in particular, the level of their maturity
tools supporting NLP processes. This is inclusive of the tools available in and that of the management of research data in science. Understanding
the open-source mode. The continuation of research using these the geographic location and organizational diversity of RDS should help
methods would allow determining the actual dynamics and geography in learning more about the advantages and disadvantages of considering
of changes in the library RDS. Extending the research to other countries each one of these variables. It did assist in the identification of problems
would allow for the benchmarking of the level of maturity of the service faced by the data librarians, while attempting to offer their users mature
and, thus, also the level of its suitability to meet the needs of the in information services.
formation users. At the same time, overall usefulness of the Cox et al.
(2019) maturity model in such studies was effectively corroborated. CRediT authorship contribution statement
The study findings indicated the lack of the most mature type of li
brary activities, thereby pointing to the necessity of implementing sig Marek Nahotko: Conceptualization, Methodology, Writing – Orygi
nificant changes in the organization of work. This merely goes to show nal draft, Writing – Review & editing, Funding acquisition.
that libraries tend to treat activities in the RDS area in a conservative Magdalena Zych: Software, Formal analysis, Writing.
manner by trying to have them adapted to their traditional functions. Aneta Januszko-Szakiel: Writing, Validation.
The libraries were inclined to create the types of services that corre Małgorzata Jaskowska: Investigation, Verification.
sponded to their existing activities (training, open access), and the
already existing skill set of the librarians (repository, selection, access, Funder
curation, metadata). As a result, RDM was often interpreted through the
“traditional” roles of libraries. In order to bring about truly meaningful Jagiellonian University, grant no: ID.UJ DigiWorld 2021/1
change to this organizational paradigm, the librarians will be required to
boost their analytical expertise relative to big data techniques as part of Data availability
their current educational programs.
The study was limited by being confined to a single country and only Data will be made available on request.
one type of library (academic), so this factor should be taken into ac Research data set for publication "Big data-driven investigation of
count in any subsequent studies by way of extending the geographic the maturity of library research data services (RDS)" (Original data)
scope, and the type of libraries to be put under research scrutiny, as well (Jagiellonian University Repository)
as the actual dynamics of change over time. Other GLAM institutions
In color are the terms for which hits were obtained during web scraping
13
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
14
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
References Heidorn, B. (2008). Shedding light on the dark data in the long tail of science. Library
Trends, 57(2), 280–299.
Hoy, M. (2014). Big data: An introduction for librarians. Medical Reference Services
Ahmed, W., & Ameen, K. (2017). Defining big data and measuring its associated trends in
Quarterly, 33(3), 320–326.
the field of information and library management. Library Hi Tech News, 34(9), 21–24.
Humphrey, W. (1989). Managing the software process. Reading, MA: Addison-Wesley.
Al-Jaradat, O. (2021). Research data management (RDM) in Jordanian public university
Iwasiński, Ł. (2020). Theoretical bases of critical data studies. ZIN Information Studies, 58
libraries: Present status, challenges and future perspectives. Journal of Academic
(1A), 96–109.
Librarianship, 47(5), Article 102378.
Karasti, H., Baker, K., & Halkola, E. (2006). Enriching the notion of data curation in e-
Amirian, P., van Loggerenberg, F., & Lang, T. (2017). Data science and analytics. In
science: Data managing and information infrastructuring in the long term ecological
P. Amirian, T. Lang, & F. van Loggerenberg (Eds.), Big data in healthcare (pp. 15–37).
research (LTER) network. Computer Supported Cooperative Work, 15(4), 321–358.
Cham: Springer. Springer Briefs in Pharmaceutical Science & Drug Development.
Kelleher, J. D., & Tierney, B. (2018). Data science. Cambridge, MA.: MIT Press.
Anderson, C. (2004). The long tail. Wired Magazine, 12(10). https://www.wired.
Kim, J. (2021). Determining research data services maturity: The role of library
com/2004/10/tail/.
leadership and stakeholder involvement. Library and Information Science Research, 43
Arbia, G., & Nardelli, V. On Spatial Lag Models estimated using crowdsourcing, Web-scraping
(2), Article 101092.
or other unconventionally collected data. (2020). Retrieved from https://arxiv.
Kim, S., & Syn, S. (2021). Practical considerations for a library’s research data
org/abs/2010.05287 Accessed September 12, 2022.
management services: The case of the National Institutes of Health library. Journal of
Arzberger, P., et al. (2004). Promoting access to public research data for science,
the Medical Library Association, 109(3), 450–458.
economic, and social development. Data Science Journal, 3, 135–152.
Kitchen, R., & McArdle, G. (2016). What makes big data, big data? Exploring the
Ball, R. (2019). Big data and their impact on libraries. American Journal of Information
ontological characteristics of 26 datasets. Big Data and Society, 3(1), 1–10.
Science and Technology, 3(1), 1–9.
Koltay, T. (2017). Data literacy for researchers and data librarians. Journal of
Baškarada, S., & Koronios, A. (2017). Unicorn data scientist: The rarest of breeds.
Librarianship and Information Science, 49(1), 3–14.
Program, 51(1), 65–74.
Kouper, I., Fear, K., Ishida, M., et al. (2017). Research data services maturity in academic
Berman, E. (2017). An exploratory sequential mixed methods approach to understanding
libraries. In L. Johnston (Ed.), 1. Curating research data (pp. 153–170). Chicago:
researchers' data management practices at UVM: integrated findings to develop
ACRL. Practical strategies for your digital repository.
research data services. Journal of eScience Librarianship, 6(1), Article e1104.
Lahti, L., Marjanen, J., Roivanen, H., & Tolonen, M. (2019). Bibliographic data science
Big data. (2022). Oxford English Dictionary. Retrieved from https://www.oed.
and the history of the book (c. 1500–1800). Cataloging & Classification Quarterly, 57
com/view/Entry/18833#eid301162177 Accessed September 6, 2022.
(1), 5–23.
Borgman, C., et al. (2016). Data management in the long tail: Science, software and
Laney, D. (2001). 3-D data management: controlling data volume, velocity and variety.
service. International Journal of Digital Curation, 11(1), 128–149.
META Group Research Note. Retrieved from http://blogs.gartner.com/doug-laney/fi
Brooks, D. (2013). The Philosophy of Data, The New York Times. Retrieved from:
les/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-
https://www.nytimes.com/2013/02/05/opinion/brooks-the-philosophy-of-data.
Variety.pdf Accessed September 22, 2022.
html Accessed September 8, 2022.
Laskowski, C. (2021). Structuring better services for unstructured data: Academic
Bryant, R., Lavoie, B., & Malpas, C. (2017). A tour of the research data management (RDM)
libraries are key to an ethical research data future with big data. Journal of Academic
service space. The realities of research data management. Dublin, OH: OCLC Research.
Librarianship, 47(4), Article 102335.
Chen, X., et al. (2016). Mapping the research trends by co-word analysis based on
Leetaru, K. (2015). Mining libraries: Lessons learned from 20 years of massive computing
keywords from funded project. Procedia Computer Science, 91, 547–555.
on the world’s information. Information Services & Use, 35(1–2), 31–50.
Cleveland, W. S. (2001). Data science: An action plan for expanding the technical areas of
Lewis, M. (2010). Libraries and the management of research data. In S. McKnight (Ed.),
the field of statistics. International Statistical Review, 69(1), 21–26.
Envisioning future academic library services (pp. 145–168). London: Facet Publ.
Cooke-Davies, T. (2004). Project management maturity models. In P. Morris, & J. Pinto
Leydesdorff, L. (1987). Various methods for the mapping of science. Scientometrics, 11(5/
(Eds.), The Wiley guide to managing projects (pp. 1234–1255). Hoboken: Wiley.
6), 295–324.
Cox, A., Kennan, M., Lyon, L., et al. (2019). Maturing research data services and the
Li, J., Lu, M., Dou, G., & Wang, S. (2017). Big data application framework and its
transformation of academic libraries. Journal of Documentation, 75(6), 1432–1462.
feasibility analysis in library. Information Discovery and Delivery, 45(4), 161–168.
Cox, A., et al. (2017). Development in research data management in academic libraries:
Liu, S., & Shen, X. (2018). Library management and innovation in the big data era.
Towards an understanding of research data service maturity. Journal of the
Library Hi Tech, 36(3), 374–377.
Association for Information Science and Technology, 68(9), 2182–2200.
Lunn, S., Zhu, J., & Ross, M. (2020). Utilizing web scraping and natural language
Cox, A., & Pinfield, S. (2014). Research data management and libraries: Current activities
processing to better inform pedagogical practice. In 2020 IEEE Frontiers in Education
and future priorities. Journal of Librarianship and Information Science, 46(4),
Conference (FIE), Uppsala, 21-24 Oct. 2020 (pp. 1–9). IEEE. https://doi.org/10.1109/
299–316.
FIE44824.2020.9274270 Accessed August, 15, 2022.
Crosby, P. (1979). Quality is free. New York: McGraw-Hill.
Manjunatha, K. (2016). Content analysis of special library websites: An analytical study.
Davenport, T. H., & Patil, D. J. (2012). Data scientist: The sexiest job of the 21st century.
International Journal of Next Generation Library and Technologies, 2(2), 1–9.
Harvard Business Review, 90(5), 70–76.
Manyika, J., et al. (2011). Big data: The next frontier for innovation, competition, and
Delserone, L. (2008). At the watershed: Preparing for research data management and
productivity. San Francisco, CA: McKinsey Global Institute.
stewardship at the University of Minnesota Libraries. Library Trends, 57(2), 202–210.
Mayer-Schönberger, V., & Cukier, K. (2013). Big data. The essential guide to work, life and
Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12), 64–73.
learning in the age of insight. London: John Murray.
Diggle, P. J. (2015). Statistics: A data science for the 21st century. Journal of the Royal
McCaffrey, M., & Giesbrecht, W. (2016). Teaching data librarianship to LIS students. In
Statistical Society: Series A (Statistics in Society), 178(4), 793–813.
L. Kellam, & K. Thompson (Eds.), Databrarianship: The academic data librarian in
Van Dijk, J. (2020). The network society. London: SAGE Publ.
theory and practice (pp. 355–373). Chicago: ACRL.
Dongo, I., et al. (2020). Web scraping versus Twitter API: a comparison for a credibility
Naur, P. (1974). Concise survey of computer methods. Lund, Sweden: Petrocelli Books.
analysis. In Proceedings of the 22nd international conference on information integration
Nicholson, S. (2003). Bibliomining for automated collection development in a digital
and web-based applications & services (pp. 263–273). https://doi.org/10.1145/
library setting: Using data mining to discover web-based scholarly research works.
3428757.3429104
Journal of the American Society for Information Science and Technology, 54(12),
Dumbill, E. (2013). Making sense of big data. Big Data, 1(1), 1–2.
1081–1090.
Feger, S., et al. (2020). ‘Yes, I comply!’: motivations and practices around research data
Osika, G. (2021). Dilemmas of social life algorithmization – Technological proof of
management and reuse across scientific fields. In , 4. Proceedings of the ACM on
equity. Scientific papers of Silesian University of Technology. Organization and
human-computer interaction (p. 141). New York: ACM. CSCW2.
Management Series, 151, 525–538.
Foreman, J. W. (2013). Data smart: Using data science to transform information into insight.
Pareek, S., & Gupta, D. (2012). Information about services and information resources on
Hoboken, NJ: John Wiley & Sons.
websites of selected libraries in Rajasthan: A study. DESCIDOC Journal of Library &
Gardner, S., Juricek, J., & Xu, G. (2008). An analysis of academic library web pages for
Information Technology, 32(6), 499–508.
faculty. The Journal of Academic Librarianship, 34(1), 16–24.
Petrovich, E. (2021). Science mapping and science maps. Knowledge Organization, 48(7/
Garoufallou, E., & Gaitanou, P. (2021). Big data: Opportunities and challenges in
8), 535–562.
libraries, a systematic literature review. College & Research Libraries, 82(3), 410–435.
Pinfield, S., Cox, A., & Smith, J. (2014). Research data management and libraries:
Goczyła, K. (2021). Udanawianie wszystkiego. Retrieved from Pismo PG, 28(3), 79–80
relationships, activities, drivers and influences. PLoS ONE, 9(12), Article e114734.
Accessed October 12, 2022 https://pg.edu.pl/documents/1152961/104233222/2
Provost, F., & Fawcett, T. (2013). Data science and its relationship to big data and data-
02103.pdf.
driven decision making. Big Data, 1(1), 51–59.
Granville, V. (2014). Developing analytic talent: Becoming a data scientist. Hoboken, NJ:
Qin, J., Crowston, K., Flynn, C., et al. (2015). A capability maturity model for research data
John Wiley and Sons.
management. Syracuse, NY: School of Information Studies, Syracuse University.
Gregory, K., Geiger, L., & Salisbury, P. (2022). Voyant tools and descriptive metadata: A
Qin, J., Crowston, K., Flynn, C., & Kirkland, A. (2017). Pursuing best performance in
case study in how automation can compliment expertise knowledge. Journal of
research data management by using the capability maturity model and rubrics.
Library Metadata, 22(1–2), 1–16.
Journal of eScience Librarianship, 6(2), Article e1113.
Gudivada, V., Rao, D., & Raghavan, V. (2015). Big data driven natural language
Ran, C., Yang, L., & Hu, L. (2021). Revisit the implementation status of research data
processing research and applications. In V. Govindaraju, V. Raghavan, & C. Rao
management in Chinese academia. Journal of Academic Librarianship, 47(3), Article
(Eds.), 33. Handbook of statistics (pp. 203–238). Amsterdam, Oxford: Elsevier.
102350.
Harari, Y. N. (2017). Homo deus: A brief history of tomorrow. London: Vintage.
Ratner, B. (2017). Statistical and machine-learning data mining: Techniques for better
Hayashi, C. (1998). Preface. In C. Hayashi (Ed.), Data science, classification, and related
predictive modelling and analysis of big data. Boca Raton: Chapman and Hall/CRC
methods (pp. V–VII). Tokyo: Springer Verl.. Proceedings of the Fifth Conference of
Press.
the International Federation of Classification Societies (IFCS-96), Kobe, Japan,
March 27–30, 1996.
15
M. Nahotko et al. The Journal of Academic Librarianship 49 (2023) 102646
Regueira, U., Alonso-Ferreiro, A., & Da-Vila, S. (2020). Women on YouTube: Uhlir, P., & Schröder, P. (2007). Open data for global science. Data Science Journal, 6
Representation and participation through the web scraping technique. Comunicar, 28 (special issue), 36–53.
(63), 31–40. Uzun, E. (2020). A novel web scraping approach using the additional information
Reinhalter, L., & Wittman, R. (2014). The library: Big data's boomtown. Serials Librarian, obtained from web pages. IEEE Access, 8, 61726–61740.
67(4), 363–372. Verbaan, E., & Cox, A. (2014). Occupational sub-cultures, jurisdictional struggle and
Sanchez-Pinto, L. N., Luo, Y., & Churpek, M. M. (2018). Big data and data science in third space: Theorising professional service responses to research data management.
critical care. Chest, 154(5), 1239–1248. Journal of Academic Librarianship, 40(3–4), 211–219.
Schramm, M., & Shafaghi, M. (2020). Moving from big data to smart data for enhanced Virkus, S., & Garoufallou, E. (2019). Data science from a perspective of computer science.
performance, business efficiency, and new business models. Journal of International In E. Garoufallou, F. Fallucchi, & W. De Luca (Eds.), Metadata and semantic research.
Business and Management, 3(2), 1–17. 13th international conference, MTSR 2019. Cham: Springer Verl.
Schutt, R., & O’Neil, C. (2014). Doing data science: Straight talk from the frontline. Voulgaris, Z. (2014). Data scientist: The definitive guide to becoming a data scientist.
Sebastopol, CA: O’Reilly Media. Westfield, NJ: Technics Publications.
Semeler, A., Pinto, A., & Rozados, H. (2019). Data science in data librarianship: Core Wainer, H. (2015). Truth or truthiness: Distinguishing fact from fiction by learning to think
competencies of a data librarian. Journal of Librarianship and Information Science, 51 like a data scientist. Cambridge, MA.: Cambridge University Press.
(3), 771–780. Whyte, A., & Tedds, J. (2011). Making the case for research data management. DCC briefing
Simović, A. (2018). A big data smart library recommender system for an educational papers. Edinburgh: Digital Curation Centre.
institution. Library Hi Tech, 36(3), 498–523. Whyte, A. (2014). A pathway to sustainable research data services: From scoping to
Singh, R., Bharti, S., & Madalli, D. (2022). Evaluation of research data management sustainability. In G. Pryor, S. Jones, & A. Whyte (Eds.), Delivering research data
(RDM) services in academic libraries of India: A triangulation approach. Journal of management services (pp. 59–88). London: Facet.
Academic Librarianship, 48(6), Article 102586. Xiu, J., & Wang, M. (2014). Competencies and responsibilities of social science data
Song, I. Y., & Zhu, Y. (2016). Big data and data science: What should we teach? Expert librarians: An analysis of job descriptions. College & Research Libraries, 75(3),
Systems, 33(4), 364–373. 362–388.
Stanton, J. M., et al. (2012). Interdisciplinary data science education. In Special issues in Xu, Z., et al. (2022). A scoping review: Synthesizing evidence on data management
data management (pp. 97–113). Washington, DC: American Chemical Society (ACS instruction on academic libraries. Journal of Academic Librarianship, 48(3), Article
Symposium Series, 1110). 102508.
Tang, R., & Hu, Z. (2019). Providing research data management (RDM) services in Yan, Y. (2020). Industry requirements for translators across China before COVID-19:
libraries: Preparedness, roles, challenges, and training for RDM practice. Data and Analyzing 51job listings through web scraping. Revista Argentina de Clínica
Information Management, 3(2), 84–101. Psicológica, 29(4), 768–779.
Tella, A., & Kadri, K. (2021). Big data and academic libraries: Is it big for something or Yang, Z., Zhu, R., & Zhang, L. (2016). Research on the capability maturity model of
big for nothing? Library Hi Tech News, 38(2), 15–23. digital library knowledge. In B. Xu (Ed.), Proceedings of the 2nd International
Tenopir, C., Sandusky, R., Allard, S., & Birch, B. (2014). Research data management Technology and Mechatronics Engineering Conference (ITOEC 2016) (pp. 333–337).
services in academic research libraries and perceptions of librarians. Library & Zhengzhou: Atlantis Press. Chongqing, China, May 21–22, 2016.
Information Science Research, 36(2), 84–90. Yidavalapati, J., Sinha, P., & A, S. (2021). Research data management and services in
Tenopir, C., et al. (2017). Research data services in European academic research South Asian academic libraries. Library Philosophy and Practice, 6457.
libraries. LIBER Quarterly, 27(1), 23–44. Yoon, A. (2017). Role of communication in data reuse. Proceedings of the Association for
Tenopir, C., et al. (2015). Research data services in academic libraries: Data intensive Information Science and Technology, 54(1), 463–471.
roles for the future? Journal of eScience Librarianship, 4(2), Article e1085. Yoon, A., & Donaldson, D. (2019). Library capacity for data curation services: A US
Tiwari, A., & Madalli, D. (2021). Maturity models in LIS study and practice. Library and national survey. Library Hi Tech, 37(1), 811–828.
Information Science Research, 43, Article 101069. Yoon, A., & Schultz, T. (2017). Research data management services in academic libraries
Tolonen, M., Marjanen, J., Roivainen, H., & Lahti, L. (2019). Scaling up bibliographic in the US: A content analysis of libraries' websites. College & Research Libraries, 78(7),
data science. In C. Navaretta, M. Agirrezabal, & B. Maegaard (Eds.), Digital 920–933.
humanities in the Nordic countries (pp. 450–456). Copenhagen: Univ. of Copenhagen. Zeng, M. (2019). Semantic enrichment for enhancing LAM data and supporting digital
Proc. of the digital humanities in nordic countries 4th conference. humanities. Review article. El Profesional de la Información, 28(1), Article e280103.
Tu, Y., Chang, S., & Hwang, G. (2021). Analysing reader bahaviours in self-service library Zhan, M., & Widén, G. (2019). Understanding big data in librarianship. Journal of
stations using a bibliomining approach. The Electronic Library, 39(1), 1–16. Librarianship and Information Science, 51(2), 561–576.
16