Developing A Theoretically Founded Data Literacy Competency Model
Andreas Grillenberger Ralf Romeike
Friedrich-Alexander-Universität Erlangen-Nürnberg Friedrich-Alexander-Universität Erlangen-Nürnberg
Computing Education Research Group Computing Education Research Group
Erlangen, Germany Erlangen, Germany
[email protected] [email protected]
ABSTRACT data management and data science, the awareness for data-driven
Today, data is everywhere: Our digitalized world depends on enor- technologies and methods has strongly increased. In particular,
mous amounts of data that are captured by and about everyone and data-intense scientific discovery is nowadays even considered a new
considered a valuable resource. Not only in everyday life, but also in research paradigm (cf. [14]) alongside empirical, theoretical and
science, the relevance of data has clearly increased in recent years: computational/simulation approaches. But not only scientists come
Nowadays, data-driven research is often considered a new research into contact with data regularly, instead data-driven technologies,
paradigm. Thus, there is general agreement that basic competen- results of data analyses and the task to store data appropriately can
cies regarding gathering, storing, processing and visualizing data, also be recognized in various situations throughout daily life. In
often summarized under the term data literacy, are necessary for recent years, data has become a topic of societal discourse, in par-
every scientist today. Moreover, data literacy is generally important ticular focused on rather problematic aspects, such as the unautho-
for everyone, as it is essential for understanding how the modern rized disclosure and analysis of data (as recently done by Facebook
world works. Yet, at the moment data literacy is hardly considered and Cambridge Analytica1 ) or the influence of elections with the
in CS teaching at schools. To allow deeper insight into this field help of data analyses. Thus, knowing about the possibilities offered
and to structure related competencies, in this work we develop a by data and data analysis plays an increasing role for developing
competency model of data literacy by theoretically deriving central an understanding of the world. As a result of an ACM workshop,
content and process areas of data literacy from existing empirical Frank and Walker [11] summarize: “As data, open, big, personal
work, keeping a school education perspective in mind. The result- or in any other guise, becomes increasingly important, power will
ing competency model is contrasted to other approaches describing flow to those who are able to create, control and understand data.
data literacy competencies from different perspectives. The prac- Those who cannot, will become powerless. Further, their ability to
tical value of this work is emphasized by giving insight into an participate in society will be severely challenged as they lack the
exemplary lesson sequence fostering data literacy competencies. tools to engage with an important raw material of society.” Hence,
to be able to cope with the new chances and challenges that arise,
KEYWORDS researchers, practitioners and generally everyone has to acquire
some competencies and understand phenomena in the context of
data, data literacy, data science, data management, competency
data, for instance how large volumes of data can lead to unexpect-
model, CS education
edly accurate predictions of information not obviously included
ACM Reference Format: in a data set. To foster basic knowledge and competencies in this
Andreas Grillenberger and Ralf Romeike. 2018. Developing a Theoreti- area, for example Wolff and Koertuem [22] discussed simple aspects
cally Founded Data Literacy Competency Model. In Proceedings of the 13th of data analysis and visualization with seventh and ninth grade
Workshop in Primary and Secondary Computing Education (WiPSCE ’18),
students in the context of energy usage. Such competencies are
October 4–6, 2018, Potsdam, Germany. ACM, New York, NY, USA, 10 pages.
more recently summarized under the term data literacy: In higher
education, several approaches for teaching data literacy in inter-
disciplinary approaches and/or from a practical perspective have
already been proposed and evaluated, but only few reports set a
In recent years, the perception and use of has changed considerably: focus on competencies. Yet, the relevance of data literacy is not
While in the past, data was a topic for computer scientists only, restricted to higher education: As a sound understanding of phe-
nowadays it becomes increasingly relevant in all scientific fields. nomena occurring in everyday life is based on knowledge about
Based on tremendous advances, especially in the emerging fields data analyses, predictions and their limits, data literacy is also a
WiPSCE ’18, October 4–6, 2018, Potsdam, Germany Andreas Grillenberger and Ralf Romeike
and secondary education cannot be anchored in these existing ap- important insight into this field: They were created based on the
proaches, but needs to be built up systematically and anchored in requirements set out, for example, in job advertisements for data
school education, keeping didactic aspects in mind. analysts, which give a clear impression of what others expect from
Thus, in this paper, after having created a theoretical foundation, data scientists, but hardly consider a scientific perspective, which
we describe the development of a theoretically founded competency might set different focus points. Hence, as a basis for developing
model of data literacy: In contrast to other work, we set our focus this competency model, in previous work [12] we also conducted a
on a general knowledge perspective on the field. After developing qualitative content analysis with the goal to describe the contents of
the competency model, we discuss the resulting model, characterize data science with a focus on the scientific perspective: Documents
it with exemplary competencies and contrast it to other approaches. describing several data science study programs were investigated
To give an impression of its practical relevance, finally we conclude with the goal to determine the content knowledge expected of the
by outlining a data-literacy-oriented lesson sequence for secondary graduates. As this specific analysis is not the focus of this paper,
CS education. but will function as an important basis for the competency model,
key aspects are summarized below:
2 DATA MANAGEMENT AND DATA SCIENCE • From all study programs related to data science in Germany,
AS FOUNDATION OF DATA LITERACY we selected those which set a clear focus on data science
Data literacy can be defined as the “ability to collect, manage, evalu- (N = 11 in June 2018).
ate, and apply data, in a critical manner” [16] or, more extensively, • By analyzing the respective module descriptions of manda-
as “the knowledge of what data are, how they are collected, analyzed, tory courses, we identified central contents that every grad-
visualized and shared, and [. . . ] the understanding of how data are uate should get to know.
applied for benefit or detriment, within the cultural context of security • As a result, four central content areas of data science with
and privacy.” [3] These are not the only approaches to defining various specific contents were identified: data analysis and
this relatively new topical field, however, all definitions share an machine learning, big data, data privacy and data ethics and
incorporation of various aspects related to handling data. From data storage.
a CS perspective, most of the tasks described by the two defini-
tions mentioned before originate from data management and data
science: Data management, as well as the original field databases, Practices Core Technologies
focuses on rather static aspects related to data, in particular on • acquisition file stores, databases, data stream systems, data
• cleansing analyses, data mining, semantic web, document stores
how they are stored and accessed appropriately, while data science
• modeling Design Principles Mechanics
sets its focus on the rather dynamic aspects, such as data analysis
• implementation
and visualization. Hence, we assume that both fields give a clear • data independence • structurization
• optimization • integrity
impression of data literacy from a CS perspective and are a suitable • • representation
analysis • consistency
basis for investigating this field in depth. Thus, central ideas of • visualization • isolation • replication
• • durability • synchronization
both fields need to be taken into account when developing a data evaluation
• availability
• sharing • partitioning
literacy competency model. • partition tolerance
• archiving • concurrency • transportation
Data management has already been thoroughly investigated
• erasure • redundancy • transaction
from a CS education perspective, in particular its long-lasting key
concepts were identified and structured in the model of key con-
cepts of data management [13]. The core technologies, practices, Figure 1: Model of key concepts of data management [13].
design principles and mechanics of data management, as summa-
rized in this model (cf. fig. 1), were derived empirically based on
a qualitative content analysis of established textbooks from this acquisition
field and structured by adopting the model of the Great Principles archiving
of Computing [10]. As data management and data literacy exhibit erasing
strong overlaps, we assume that this model is suitable for getting
first insights into data literacy. Additional insights from a practical
sharing modeling
perspective are gained by investigating data life cycle models (e. g.
fig. 2). Later, by comparing our resulting competency model with
existing data literacy competency descriptions, this assumption is Life-Cycle
evaluation implementation
The second foundational field of data literacy, data science, was
investigated in-depth particularly in the EDISON project2 . As part
visualization optimization
of this project, a competency framework [9] and a body of knowl-
edge [8] have been developed, along with a model curriculum and
a professional framework. Especially the first two documents give
2 EDISON is a research project trying to build a data science profession.
Figure 2: Data life cycle [13].
Developing a Theoretically Founded Data Literacy Competency Model WiPSCE ’18, October 4–6, 2018, Potsdam, Germany
3 DATA LITERACY IN (CS) EDUCATION Figure 3: Data literacy competencies determined by Ridsdale
et al. [16].
Although data literacy is not limited to higher education, it has
hardly been considered from a general (CS) education perspective.
For several years, even data in general was rarely discussed in CS
education research, was instead focused on other aspects of CS. De- and “data ethics” are considered as data literacy competencies and
spite not being in focus, there are indications that data literacy can “identifies useful data”, “cleans data” and “applies and works with data
be considered as a central part of general knowledge. For example, in an ethical manner” as exemplary knowledge/tasks. Yet, because
when adapting the concept of computational thinking (cf. [20]) for of their different focus, directly adopting these results for school
mathematics and science classrooms, as one of four central aspects teaching is not possible.
Weintrop et al. [19] introduced data practice, described as collecting,
creating, manipulating, analyzing and visualizing data: “Data lie at
the heart of scientific and mathematical pursuits. They serve many
purposes, take many forms, and play a variety of roles in the conduct COMPETENCY MODEL
of scientific inquiry.” [19] Despite using a term other than data liter- The literature review has shown that, despite the relevance of data
acy, their work considers various aspects of data literacy as topics literacy, there is no competency model or description of data literacy
for high school education. Even from a CS education perspective, available that is directly suitable for CS education in schools. Instead
aspects of data literacy are not completely out-of-scope: For exam- of adapting an existing approach, we decided to develop a com-
ple, the CSTA/ISTE Computational Thinking Teacher Resources pletely new data literacy competency model based on a theoretical
[5] involve several aspects of handling and analyzing data that approach, keeping this larger target audience in mind throughout
are clearly related to data literacy, e. g. that grade 9 to 12 students the development process. Therefore, we based our work on the
should “develop a survey and collect both qualitative and quanti- assumption that the aforementioned work on data management
tative data to answer the question: ‘Has global warming changed and data science is a suitable basis for this purpose, which is evalu-
the quality of life?’” and “Use appropriate statistical methods that ated afterwards by contrasting our model against the existing data
will best test the hypothesis: ‘Global warming has not changed the literacy competency model by Ridsdale et al. [16]. In our approach,
quality of life’.” Another example are the 2017 CSTA K-12 Computer we set a focus on the scientific perspectives on the underlying fields.
Science Standards [4], which also consider aspects related to data Following our basic assumptions, this allows to theoretically found
literacy, such as “Identify and describe patterns in data visualizations, and argue the resulting competencies with high validity.
such as charts or graphs, to make predictions.” Yet, these approaches In accordance with other competency models, in particular the
in general miss a technical foundation and do not consider data one of the German educational standards for computer science in
literacy in a systematic way. secondary schools [1] as well as the NCTM principles and standards
Despite this clear relevance of data literacy for general knowl- for school mathematics [15], we decided to divide the model into
edge, most work on this topic focuses on higher education: Inspired two parts: Content areas reflect the CS content addressed by the
by the vision of Jim Gray [14], data-intense scientific discovery (also competencies, while process areas emphasize the practical activities.
referenced to as eScience) is considered a new research paradigm This separation is promising for data literacy, as it considers two
based on processing and analyzing the immense amounts of ob- different perspectives on each data literacy competency by relating
servational research data. This new paradigm becomes important it to both a process area, which includes practices that reflect how
in almost every scientific discipline, hence there is a clear need to people come into contact with data, how they handle and process
foster data literacy competencies in higher education. Following them, but also to a content area which considers the theoretical
this need, in a study on strategies and best-practices for data liter- background and the underlying scientific concepts that need to
acy education, Ridsdale et al. [16] investigated articles about data be understood. Due to their strong interconnection, these areas
literacy and related topics, but also gray literature such as reports, cannot be considered completely separate: For example, the poten-
white papers and informal literature such as blog posts. Hence, they tial process area data analysis represents an important practice in
consider how data literacy is seen from different perspectives and this field, as people mostly come into contact with data by reading
identified 23 competencies (cf. fig. 3) and 64 tasks/skills of data liter- about data analyses. Yet, for understanding how they work and for
acy: For example, “data discovery and collection”, “data manipulation” assessing their results it is not enough to know how to use software
WiPSCE ’18, October 4–6, 2018, Potsdam, Germany Andreas Grillenberger and Ralf Romeike
to conduct data analyses. Instead, several concepts of data manage- but only further characterizes the terms, which is appropriate for
ment and data science need to be understood, including aspects the targeted goal. This led to the following more detailed topics:
from the potential content areas analysis methods, data storage, vi- • data analysis and machine learning:
sualization and data ethics. In CS lessons, depending on the desired – methods of data analysis, such as classification and clus-
educational goals, the focus can be shifted between content and tering
process areas, but none of them can be left out completely. Hence, – predictions based on data
representing the competency model with two intertwined types – learning from data, in particular unsupervised and super-
of areas particularly emphasizes the wide variety of links between vised learning
practically-oriented and content-oriented aspects. – quality of data and analysis results
With respect to the different natures of these areas, we first – basic ideas of data analysis, such as data vs. information,
consider them separately: In the next two sections, we address the information entropy, correlations vs. causalities
content areas and later the process areas and describe their origin, • big data:
emergence and the resulting aspects. Afterwards, both types of – correlation-based data analysis
areas are merged into a competency model and the resulting model – techniques for managing large amounts of data
is discussed and contrasted to existing approaches. – systems for storing large amounts of data
• data privacy and data ethics:
4.1 Deriving the Content Areas of Data – data ethics
Literacy – basics of data security and safety
– personal data
For deriving the content areas of data literacy, finding a basis that
– data privacy
gives appropriate insight into the complete field is necessary. For
• data storage:
this purpose, we use the aforementioned results of the study on con-
– systems for storing and managing data
tents of data science, but also refer to the content-oriented aspects
– function principles of data storage systems
of the model of key concepts of data management. Hence, seven
• core technologies:
aspects that will be used as the basis for deriving the content areas
– systems for storing and managing data
were identified: Coming from data science, there is data analysis
• mechanics:
and machine learning, big data, data privacy and data ethics and
– function principles of data storage systems
data storage; from data management, we get the core technologies,
– representing data on a physical level
practices and mechanisms (cf. fig. 1). However, on closer examina-
• design principles:
tion, it becomes apparent that the aspects described there cannot
– ways for accessing data
be used as content areas without further discussion, because some
– requirements on data stores and data storage
of them have clear overlap (e. g. aspects of big data and data storage
with parts of the mechanisms and design principles) and/or need To eliminate overlaps and to (re-)combine similar aspects into
to be clarified in detail as they are rather unspecific (such as big one content area, we reorganized the topics and merged subordinate
data in general). Also, some terms, for example mechanics, describe aspects under (partly new) superordinate terms, which resulted in
data management on a rather conceptual level, while others, such four content areas described below. The links between the aforemen-
as core technologies, reflect a more abstract technological level. tioned topics and the content areas are shown in fig. 4: For example,
Hence, for deriving the content areas of our data literacy model, predictions based on data became part of the content area C3 (data
these candidates were consolidated keeping in mind the target analysis), while the basic ideas of data analysis cover aspects that
audience of the model, in particular secondary CS teachers. This are also related to C3, but also to C1 (data and information).
led to the following criteria, which should be fulfilled by the final (C1) Data and information was introduced as additional content
content areas: area. It covers basic knowledge, such as the difference be-
The content areas. . . tween information and data, ways for representing informa-
• represent sets of strongly related concepts/ideas. tion as data, but also the difference between small and large
• focus on topics that are specifically relevant to data literacy, amounts of data regarding their meaningfulness. Hence, this
not only to CS in general. area contains aspects of the topical areas data analysis / ma-
• have as little overlap as possible. chine learning, big data, data storage and mechanics.
• give clear insight into one part of data literacy. (C2) Data storage and access is focused on aspects concerning the
• emphasize a content-related perspective on data literacy. storage of and access to data and hence concepts that are
particularly related to data management, but also consid-
To find content areas that fulfill these criteria, we had to clarify ered relevant to data science. In particular, this area contains
the subject areas in particular by further characterizing these terms aspects such as replication or synchronization of data, repre-
based on their respective definitions, but also by investigating their sentation of data on storage media, but also accessing data.
overlaps and differences. For this purpose, they were first narrowed (C3) Data analysis is particularly focused on methods, algorithms
down to a longer list of more specific topics. This allows for a and principles that are central to analyzing data, making
more detailed insight into these areas while emphasizing their predictions based on those and learning from data. With
overlaps and similarities. Of course, this list cannot be complete, this focus, this content area is almost identical to the subject
Developing a Theoretically Founded Data Literacy Competency Model WiPSCE ’18, October 4–6, 2018, Potsdam, Germany
data analysis /
big data C1 data storage C1
machine learning
methods of predictions learning data basic ideas of correlation vs. techniques for systems for data storage techniques/principles
data analysis based on data from data quality data analysis causality managing big data storing big data systems of data storage
C3 C3 C3 C3 C1 C3 C1 C3 C2 C2 C2 C2
ethics security personal privacy systems for function principles representing data ways for requirements on data
safety data storing/managing data of data stores on physical level accessing data and data storage
C4 C4 C4 C4 C2 C2 C2 C2 C2
Figure 4: Division of the candidates for content areas into more specific parts and assignment to the final content areas (marked
by C1-C4).
area data analysis / machine learning. Yet, in the name of the From a scientific perspective, these process areas clearly describe
content area, the term machine learning was left out: As only the entire life cycle of data and cover all aspects mentioned in typ-
aspects from this field are covered by data literacy, including ical definitions of data literacy. Yet, a list of eleven process areas
this term could be misleading. (and several content areas strongly connected to them) raises the
(C4) Data ethics and protection is directly derived from the subject legitimate question whether all of these processes are equally rele-
area data ethics and privacy, yet by changing privacy to vant for school education. In addition, this large list of terms makes
protection, the focus is expanded to also include aspects of a competency model relatively complex and extensive, and requires
data security and protecting data in general. detailed knowledge to clearly distinguish the different aspects. Thus,
in a next step, our goal was to evaluate these areas from a CS ed-
To summarize the above, these content areas represent various ucation perspective and in consequence to compress these areas
aspects of computer science: While the second area is obviously to a more compact and comprehensible list. To achieve this goal,
related to data management and the third to data science, their we discussed the list of candidates with teachers and researchers,
respective concepts are also related to other parts of computer some with, some without prior knowledge on data science and data
science. For example, when accessing data, computer networks management: With these participants, we considered the candi-
play an important role. But also when discussing the topic data vs. dates as guidelines through fictional CS lesson sequences with the
information, aspects of information theory and hence theoretical overarching goal to convey several aspects of data literacy. Keeping
CS such as information entropy gain importance and algorithmic this goal in mind, in two subgroups concepts for lesson sequences
aspects play a central role. Hence, in addition to their relevance to with slightly different goals and foci were developed. During this
data literacy, these content areas also emphasize strong roots of development the participants identified several problems with the
data literacy in and its links to various aspects of computer science candidates for process areas, which were discussed afterwards:
that need to be taken into account in teaching.
• Some process areas can hardly be considered separately: Es-
pecially, implementation and optimization typically go hand
4.2 Identifying Process Areas of Data Literacy in hand, but also archiving and erasing cannot be separated
based on the Data Life Cycle at all, as they are clear opposites to each other: a decision
As mentioned before, a data life cycle model was used to identify to archive data involves deciding against erasing those and
the process areas of data literacy. The processes mentioned in such vice versa. Also, data acquistion and cleansing are closely
a model represent important steps and tasks in this context and are related and typically done at the same time. Hence, during
reflected in data science and data management alike. In addition, consolidation these areas were merged, as trying to consider
they give a clear impression of how people come into contact with them separately raises problems for people using the compe-
data and how they handle them. As a basis for this step, we use tency model and is hardly reasonable because of their strong
the data life cycle model which was developed based on the model connections.
of key concepts of data management [13] (cf. fig. 2). Although • The area modeling cannot be considered independently from
this model was created from a data management perspective, it is several other areas: Modeling is already an essential task
not specific to this field of CS. Instead, it shows high accordance when deciding which aspects of the physical world to cap-
with similar models, which differ in the terms used and in setting ture as data, but also when storing data, for example in a
different emphases, but share the same meaning (cf. [13]). From this database, when planning data analyses. Hence, we consider
data life cycle, we derived eleven initial candidates for the process two different types of modeling: data modeling and process
areas: acquisition, cleansing, modeling, implementation, optimization, modeling. While the latter, for example, is an inherent part of
analysis, visualization, evaluation, sharing, archiving and erasure. data analysis, the first one needs to be considered separately
WiPSCE ’18, October 4–6, 2018, Potsdam, Germany Andreas Grillenberger and Ralf Romeike
and has a more specific value from a CS perspective. To em- classification or clustering, may be used with the goal to ex-
phasize data modeling instead of other types of modeling, tract new information from them. In addition, visualization
we combined modeling with data gathering and cleansing, is often important, as good visualizations support peoples’
as data modeling particularly takes place in this part of the understanding, but even the analysis itself might be sup-
data life cycle. ported by visual methods. Hence, this area deals with three
• Sharing is a form of handling data which is similar to archiv- questions: Which information can I extract from my data?
ing and erasing in several aspects: it needs to consider data How can I help people to easily grasp the essential? Which
privacy aspects, methods how to give others access to data, conclusions can I draw from my analysis results?
needs to ensure that unauthorized access is prevented and (P4) sharing, archiving and erasing The last aspects in the
decisions must be made with respect to ethical consider- data life cycle, sharing, archiving and erasing, are also es-
ations. As all three processes are also not only applicable sential from a data literacy perspective: In particular, shar-
for the results of the analysis, but for all data throughout ing and archiving data include ideas such as consolidation,
the whole process, merging them into one process area is pseudonymization, and anonymization. Structural meta-data
reasonable. is used to find and organize data, while also being relevant
• Finally, one area was considered missing: interpreting data for handling data on a daily basis. On the other hand, eras-
and analysis results. In the prototypical process areas, this ing data marks the end of the data life cycle and raises the
aspect was considered an inherent part of analyzing and challenge of securely deleting data. Along with sharing, it
visualizing. Yet, it is reasonable to emphasize this aspect also clarifies the challenge that deleting data completely is
more strongly, as interpretation is of tremendous importance typically not possible anymore if it has been shared with
for handling data and should never be left out. As interpreting others. Hence, in this process area the following questions
typically goes hand in hand with analyzing and visualizing are raised: Which data do I want to share with whom? Which
and as these aspects are also strongly related to each other, data do I want to archive and how? How can I delete data
all three aspects were merged into one process area. appropriately?
Based on these findings, we were able to condense the list of pro- When considering the links of these process areas to the under-
cess areas by combining several areas. Also, with interpretation, an lying fields of CS, it becomes evident that the second process area
additional process area was introduced. As a result, we determined is particularly related to data management and its key concepts,
four process areas of data literacy: while the third instead focuses on aspects of data science. In con-
trast, the first and last process areas are not related to specific areas
(P1) data gathering, modeling and cleansing of CS, but rather frame the others with generally relevant topics
This process area takes into account the early phases of concerning handling and processing data. Hence, the process areas
handling data. It combines three aspects that cannot be con- of data literacy emphasize both the static and dynamic aspects of
sidered separately: Data always needs to be structured in a data from a CS perspective, but also consider generally relevant
way suitable for storage, access and use. This already pro- topics concerning this field.
vides two modeling aspects: Deciding for the part of the real
world that should be captured as data and creating a suitable 4.3 A Prototypical Data Literacy Competency
data model. Also, it is essential to detect and eliminate errors
when gathering data and mistakes in the resulting data set
as early as possible. With these aspects, this first process By combining the process and content areas as argued before, we
area addresses four questions: Which attributes do I need to can construct the competency model of data literacy shown in fig. 5.
capture as data? How can I capture them? How can these data
be stored in a way that I can later use them? Are the captured data and information (C1)
The implementation and optimization takes place on differ- data storage and access (C2)
ent levels: In particular, it includes the implementation of
process areas
ethics on the content side and by aspects of handling and processing Data Science Data Management
Curricula Literature
data on the practical side. This is consistent with popular defini-
tions of data literacy: For example, Ridsdale et al. [16] define data
literacy focusing on the practical aspects “collect, manage, evaluate,
and apply data”, but also emphasize that these practices need to Data Science Key Concepts of
Data Life Cycle
Contents Data Management
be applied “in a critical manner”, which makes a clear technical •
• ……...
• ………
[7] or Vahey et al. [17], set similar foci and describe data literacy
mainly from a practical perspective, which is particularly covered
by our process areas.
Content Areas Process Areas
By design, the model emphasizes the strong link between content
and process areas, as merely considering a single factor is not suffi-
cient to clearly describe data literacy, give insight into this field and Figure 6: Visualization of the origin of the model.
to develop appropriate competencies. On the contrary, both areas
have to be considered closely intertwined in order to allow students
Real world problem-solving context
to develop practical competencies that are technically sound by
appropriate content knowledge. Hence, considering a process area Ethics
without connecting it to any content area or vice versa is not in-
Problem Plan Data Analysis Conclusions
tended by the model. While there are several obvious connections, Ask questions Develop Collect or Analyse and Evaluate the
from data hypotheses acquire data create validity of
for example between P3 (analyzing, visualizing and interpreting) and identify explanations explanations
potential from data based on
and C3 (data analysis), these are not the only links. Indeed, each sources of data and
data formulate
process area has connections to all the content areas and vice versa. new
In table 1, we illustrate these connections by providing exemplary
competencies for all combinations of process and content areas. As
these competencies have not been evaluated or discussed further, Figure 7: The space of data literacy skills by Wolff et al. [21].
they are not to be considered as a valid or complete list, instead
giving an impression of the scope of the competency model.
As the exemplary competencies show, various links between plan – data – analysis – conclusions), Wolff et al. [21] derived sev-
process and content areas are possible. However, neither the model eral competencies that were summarized into seven foundational
itself nor the abovementioned competencies distinguish different competencies of data literacy (cf. fig. 7). These competencies also
competency levels. Thus, in future work this model needs to be represent a data life cycle model similar to the one used for the
extended from a competency structure model to a competency level development of our model. However, as our basis considers more
model by introducing a third dimension which considers different details, it also gives better insight into the process areas, and in
levels based on further research. However, as the model was de- particular adds and explicates content areas that are not reflected
veloped based on a professional point-of-view on the field, these in the model by Wolff et al. Yet, on an argumentative basis, they
missing competency levels also suggest that the model is not re- set a stronger focus on the area problem – ask questions from data.
stricted to school education, but may also fit for other educational In our model, this aspect is not covered explicitly, as it rather pro-
levels. vides the reasons for data analysis and does not cover CS related
In the presented form, the competency model offers many bene- concepts/ideas that are not also part of analysis and evaluation.
fits for both research and practice: It allows to evaluate lessons and As we consider data literacy from a CS education perspective, we
sequences regarding the acquisition of basic data literacy compe- regard this aspect as being out-of-scope of the model, but without
tencies. Also, it may be used as the basis for developing lessons and neglecting its importance.
courses with a focus on fostering data literacy competencies and The most popular study on data literacy, which was conducted
helps to technically substantiate them. In particular, the technical by Ridsdale et al. [16], also results in a set of data literacy compe-
foundation of the model contributes to these possibilities: It was tencies, which are more detailed than the previously mentioned
derived in a theoretically-argumentative way from two existing approach. As part of their work, Ridsdale et al. identified five knowl-
empirical studies, which take into account the two fields data sci- edge areas with 22 competencies (cf. table 2). However, they use a
ence and data management that form a basis for data literacy. The different competency term, which in comparison to the commonly
origin of the developed model is also visualized in fig. 6. used competency definition by Weinert [18], seems to be on a more
abstract level: For example, the competency basic data analysis
mentioned by Ridsdale et al., following the common understanding,
5 COMPARISON WITH OTHER DATA should rather be considered a competency area as it includes vari-
LITERACY (COMPETENCY) MODELS ous different competencies. Yet, this difference does not interfere
the comparison of the meaning behind both models: Although they
As mentioned before, the model developed in this work is not the
are structured differently and originate from different methodolog-
only approach to characterize data literacy or its competencies.
ical approaches, both models contain many similar aspects and
Starting from a description of the data inquiry process (problem –
show a strong overlap, in particular related to the practices and
WiPSCE ’18, October 4–6, 2018, Potsdam, Germany Andreas Grillenberger and Ralf Romeike
Table 1: Matrix of exemplary competencies for the different combinations of process (P1–P4) and content areas (C1–C4).
P1 P2 P3 P4
gathering, modeling and implementing and analyzing, visualizing sharing, archiving and
cleansing optimizing and interpreting erasing
- choose suitable sensors for - implement algorithms for - combine data to gain new - decide whether to share
data and information
gathering the desired infor- gathering the desired data information original data
mation as data - implement simple algo- - emphasize the desired in- - decide which of the original
- structure the gathered data rithms to download data formation in visualizations data to store to keep the re-
in a suitable way for later from web APIs - interpret data and analysis quired information
analysis - discuss optimizations and results to get new informa- - decide on an appropriate
- evaluate if the captured limits of data gathering tion way to delete specific data
data represents the original
information correctly
data storage and access
- select a suitable data model - decide on a suitable data - access the data in a suitable - decide whom to give access
- structure the gathered data storage and store the data way for analysis to the stored data
in a suitable way for stor- - use possibilities for en- - use suitable data formats - determine access rights for
age abling efficient access to for the data to analyze the data
- visualize data models in a data - store their analysis results - discuss issues related to
suitable way - increase storage efficiency appropriately data validity when erasing
using compression data
- decide whether specific - implement simple analysis - decide for appropriate anal- - decide which analysis re-
data influences results of algorithms ysis methods sults to share with whom
data analysis
analysis - determine adjustment - visualize data and analysis - reason whether storing the
- structure data appropri- screws for analysis results original data is necessary
ately for analysis - optimize data analyses in - interpret the results of anal- after analyzing them
- connect data from differ- order to gain higher qual- yses - decide whether it is reason-
ent sources for analysis pur- itaty results able to share information
poses about the analysis process
data ethics and protection
- reflect ethical issues when - discuss how to anonymize - discuss the ethical impacts - reason whether storing
gathering information or pseudonymize data ap- of the conducted data anal- data for further uses should
- decide whether combining propriately yses and their results be allowed from an ethical
different data sources is rea- - exclude data from perma- - decide whether analysis perspective
sonable in specific contexts nent storage based on ethi- results are sufficiently - decide on appropriate ways
- discuss impacts on privacy cal considerations anonymized to securely erase original
when continuously captur- - choose access rights to data - reflect whether analyzing data and analysis results
ing data based on privacy issues specific data raises privacy - find ways for appropriately
issues removing attributes that
lead to privacy issues
process areas. The model by Ridsdale et al. contains all aspects that area level, instead it is considered in data and information (C1).
are also emphasized in our model, which leads to the assumption Verbal presentation meanwhile is not in the focus of our model,
that our model does not add invalid aspects. However, Ridsdale as this is a competency which is not specific to data literacy. In
et al. add some aspects not explicitly mentioned in our model, for general, the five knowledge areas presented by Ridsdale et al. are
instance metadata creation and use or presenting data (verbally). mostly equivalent to our content areas, yet they broaden the focus
These aspects are not completely out-of-scope for our model, but, of the last content area from data ethics and protection (as we call
with a focus on school education, it merely sets another emphasis it) to data application in general. However, data application is also
and thus does not cover these areas equally: While metadata is of covered in all other parts of our model, as the practices are oriented
course an important topic, it is not considered as being on content on data application in general.
Developing a Theoretically Founded Data Literacy Competency Model WiPSCE ’18, October 4–6, 2018, Potsdam, Germany
Table 2: Knowledge areas and competencies of data literacy as well. Yet, only giving practical insight without fostering techni-
as identified by Ridsdale et al. [16] cally sound knowledge is not appropriate to enable the transfer of
knowledge to new situations and to allow recognition of general
knowledge area competencies functions of such analyses. Instead, related content areas also need
conceptual framework introduction to data to be considered: In particular C3 (data analysis) gives the technical
data management data organization; data manipulation; foundation for understanding how data analyses work and how
data conversion; metadata creation and to conduct them. As considering real-world problems is a central
use; data curation, security, and re-use; aspect of the planned lessons, also C4 (data ethics and protection)
data preservation
raises important concerns.
data evaluation data tools; basic data analysis; data in-
Regarding these four selected areas, we could for example strive
terpretation (understanding data); iden-
for the following competencies in CS lessons:
tifying problems using data; data visual-
ization; presenting data (verbally); data • explain the function principle of a simple data analysis method
driven decisions making (DDDM) (mak- • explain how data can be predicted after learning from exist-
ing decisions based on data) ing data
data application critical thinking; data culture; data ethics; • interpret resulting/predicted data
data citation; data sharing; evaluating de- • optimize a prediction model e. g. by modifying the amount
cisions based on data of training data used
• discuss analysis results from an ethical perspective
• discuss the analysis approach and goals in terms of ethical
In summary, the comparison to the two exemplarily chosen mod- and societal aspects
els shows that despite the different approach of our work, the results Although the related topics can become very complex, it is pos-
are similar to existing models: All three models generally cover sible to achieve such competency goals in CS education: In a lesson
the same parts and show no contradictions with each other. This sequence (three lessons of 90 minutes each) that was conducted
also supports our fundamental assumption that existing works on with ninth grade students (about 15 years old), we were able to
data management and data science also give clear insight into data achieve these goals. We started with what most students know, an
literacy, at least when a CS education perspective on the contents online shop intending to analyze purchases in order to predict what
and practices described is taken. Correspondingly, by emphasizing people will buy next. In an unplugged approach, the students were
the distinction of process and content areas, our model makes a asked to determine rules in a fictitious data set on purchases in an
clear contribution to research in data literacy education, as this online shop that was generated exactly for this task. In this context,
model supports educators to keep both the contentual and practical classification trees were introduced and used to visualize the rules
perspectives in mind and to consider them appropriately. As a result and to predict attributes. On this foundation, we started with the
of the clearly described and comprehensible approach, it is even central task of our lesson: Based on attributes known about a set of
possible for them to reconstruct the competencies with respect to students and the points in two previous examinations, the students
a specific target audience. Additionally, this work contributes to in our class had to predict the points achieved by the other stu-
theoretically founding discussions on data literacy. Also, the model dents in a third examination. For this purpose, real data were used:
becomes more understandable as the principles behind the compe- Students were given the chance to analyze real anonymized data
tencies are well-described and as all competencies can be traced about more than 600 Portuguese students that are published in the
back to their origins. UCI Machine Learning Repository3 . This data set includes various
attributes of the students, their habits and their family situation as
6 USING THE COMPETENCY MODEL FOR well as the points they scored in three examinations. Based on the
PLANNING CS LESSONS process the students familiarized themselves with and carried out
manually before, this task was performed with software assistance
In order to give an example of how to use the developed compe-
and automated, in order to show the high potential of such analy-
tency model for CS education, in the following we will outline
ses and to allow students to adjust their analysis flexibly. For this
the development of a lesson sequence based on this model. In this
purpose, we used the tool Orange4 (cf. fig. 8), which enables data
example, the overall lesson goal is to raise students’ awareness
analysis without any programming knowledge by using a graphical
regarding the analysis of large amounts of data and predictions
interface to visualize and model the data flow. Using this tool, the
based on those. For determining more specific goals and targeted
students were able to conduct analyses and results that were fasci-
competencies, considering the process areas of data literacy was a
nating for them: In particular, they were able to predict the third
helpful approach: As we do not want to give students strict “rules”
examination grade with a relatively high accuracy and experienced
for handling their data, but instead give them insight into the pos-
both the power of such analyses, but also their limits, e. g. when try-
sibilities, limits and threats of this topic, conducting their own data
ing to optimize their results. With this simple approach, they were
analysis on real data in a context that affects them is a more suitable
even able to reproduce and comprehend parts of a scientific study
approach. Hence, at least the process area P3 (analyzing, visualizing
[2] and were able to understand a central principle of data analyses
and interpreting) has to be considered in this lesson sequence, but in
order to include insight into the limits of such analyses, process area 3
P2 (optimizing and implementing) needs to be taken into account 4
WiPSCE ’18, October 4–6, 2018, Potsdam, Germany Andreas Grillenberger and Ralf Romeike
