ACM Task Force On Data Science Education: Draft Report and Opportunity For Feedback
ACM Task Force On Data Science Education: Draft Report and Opportunity For Feedback
net/publication/331302134
ACM Task Force on Data Science Education: Draft Report and Opportunity for
Feedback
CITATIONS READS
12 641
4 authors, including:
Paul M. Leidig
Grand Valley State University
22 PUBLICATIONS 171 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Paul M. Leidig on 22 October 2019.
Initial Draft
January 2019
1.1 Charter
1.2 Prior work on defining data science curricula
1.3 Committee work and processes
1.4 Survey of academic and industry representatives
1.5 Knowledge areas
1.6 Data Science in context
1.7 Competency framework
1.8 Motivating the study of data science
1.9 Overview of this report
References
1.1 Charter
At the August 2017 ACM Education Council meeting, a task force was formed to explore a
process to add to the broad, interdisciplinary conversation on data science, with an articulation of
the role of computing discipline-specific contributions to this emerging field. Specifically, the
task force would seek to define what the computing/computational contributions are to this new
field, and provide guidance on computing-specific competencies in data science for departments
offering such programs of study at the undergraduate level.
There are many stakeholders in the discussion of data science – these include colleges and
universities that (hope to) offer data science programs, employers who hope to hire a workforce
with knowledge and experience in data science, as well as individuals and professional societies
representing the fields of computing, statistics, machine learning, computational biology,
computational social sciences, digital humanities, and others. There is a shared desire to form a
broad interdisciplinary definition of data science and to develop curriculum guidance for degree
programs in data science.
This volume builds upon the important work of other groups who have published guidelines for
data science education. There is a need to acknowledge the definition and description of the
individual contributions to this interdisciplinary field. For instance, those interested in the
business context for these concepts generally use the term “analytics”; in some cases, the
abbreviation DSA appears, meaning Data Science and Analytics.
As an inherently interdisciplinary area, data science generates interest within many fields. (See
Figure 1.) Accordingly, there have been a number of Data Science curriculum efforts, each
reflecting the perspective of the organization that created it.
Figure 1
This project looks at data science from the perspective of the computing disciplines, but
recognizes that other views contribute to the full picture. The following examples are especially
important, and have informed the committee’s work.
EDISON is a project started in September 2015 “with the purpose of accelerating the creation of
the Data Science profession.” The core EDISON consortium consists of seven partners across
Europe. Since 2015, the group has worked to create the EDISON Data Science Framework. This
collection of documents includes a general introduction, as well as four detailed components,
including:
This comprehensive set of curricular volumes parallels the intended structure of our work.
EDISON was in earlier stages as this project began; at present, it is clear that there are significant
overlaps, and future versions of our work with reconcile our model with the EDISON
curriculum, with the intention of creating a complementary volume, rather than a replicated or
competing volume.
The National Academies of Science, Engineering, and Medicine Report on Data Science for
Undergraduates (2018)
As the press release announcing the publication of the National Academies report states, “Data
science draws on skills and concepts from a wide array of disciplines that may not always
overlap, making it a truly interdisciplinary field. Students in many fields need to learn about data
collection, storage, integration, analysis, inference, communication, and ethics.” The report
highlights the demand for data scientists and calls for a broad education for students across
programs of study. Identifying many data science roles, including those related to hardware and
software platforms, data storage and access, statistical modelling and machine learning, and
business analytics, among others, the report does not presume that every data scientist will be
expert in all areas, but rather that programs will develop to allow graduates to fulfil specific
roles.
The intent of the National Academies report was to highlight the importance, breadth, and depth
of data science, and to provide high-level guidance for data science programs. It is not a detailed
curricular volume in the sense of the EDISON project or this ACM Data Science effort.
The Park City Math Institute 2016 Summer Undergraduate Faculty Program convened with the
purpose of articulating guidelines for undergraduate programs in data science. The three-week
workshop brought together 25 faculty from computer science, statistics and mathematics. The
base assertion of the report and proposed curriculum is that data is the core: “The recursive data
cycle of obtaining, wrangling, curating, managing and processing data, exploring data, defining
questions, performing analyses, and communicating the results lies at the core of the data science
experience.”
The resulting list of key competencies shows the interdisciplinary nature of data science, with an
understandable focus on the mathematics and statistics:
The role of computer science appears in the description of computational thinking: “Data science
graduates should be proficient in many of the foundational software skills and the associated
algorithmic, computational problem solving of the discipline of computer science.” However,
further description relates these skills to understanding the programming and algorithms behind
“professional statistical analysis software tools.”
The Park City report deserves further description. It includes an outline of the Data Science
Major:
The report also includes a description of each of the courses. For the purposes of this report, it is
noted that programming is introduced in Introduction to Data Science I and II, and appears again
as a part of Algorithms and Software Foundations. The course in Data Curation includes
traditional databases as well as newer approaches to data storage and interaction. The course in
Statistical and Machine Learning “blends the algorithmic perspective of machine learning in
computer science and the predictive perspective of statistical thinking.”
Although there certainly are additional aspects of computer science that are relevant to the
preparation of a student of data science, there is clearly an effort to combine the mathematical
and computer science contributions to produce a blended program. This ACM Data Science
report builds on the Park City work with a heavy orientation toward computer science. The
position of the Task Force is that any Data Science program will have to reflect competencies in
mathematics, statistics, and computer science, possibly with different emphases. This is
consistent with the view of the National Academies report. Graduates of programs following the
Park City guidelines will have valuable strengths and graduates of programs following these
ACM guidelines will have different, but equally valuable strengths.
The Business Higher Education Framework (BHEF) Data Science and Analytics (DSA)
Competency Map (2016)
The work provides a four-level competency map. The base, or Tier 1, level describes personal
effectiveness competencies. These are not considered competencies learned in school, but rather
part of an individual’s personal development. Examples include integrity, initiative,
dependability, adaptability, professionalism, teamwork, interpersonal communication, and
respect.
Tier 2 describes academic competencies to be acquired in higher education. These are most
relevant to this report and include the following:
• Deriving value from data
• Data literacy
• Data Governance and Ethics
• Technology
• Programming and Data Management
• Analytic Planning
• Analytics
• Communication
Tier 3 presents workplace competencies: planning and organizing, problem solving, decision-
making, business fundamentals, customer focus, and working with tools and technology.
Tier 4 is for Industry-Wide Technical Competencies. These are not specified, but represent skills
that are common across sectors of a larger industry context.
Though Tier 2 includes a competency in “Programming and Data Management,” the description
mentions only “Write data analysis code using modern statistical software (e.g., R, Python, and
SAS).” This set of competencies does not address a need for developing new software or
systems in support of data science, but relies on available tools.
This report was produced in 2015 by the Institute for Operations Research and the Management
Sciences (INFORMS). Reflecting the focus of programs in Business, this INFORMS curriculum
assumes basic computer literacy as a starting point. It suggests revising some of the standard
courses in statistics to meet newer needs. The resulting course list includes: Data Management,
Descriptive Analytics, Data Visualization, Predictive Analytics, Prescriptive Analytics, Data
Mining, and Analytics Practicum. It also includes electives.
Like the guidelines from the Business Higher Education Framework, the focus is on doing
something with data, primarily to serve business needs. There is no mention of programming.
The data management course includes SQL, but has no prerequisites. The emphasis in the data
mining course is on framing a business problem. Data mining techniques are compared, and
large datasets are to be used. The tools to be used for that purpose are not specified.
Initial workshops related to this ACM Data Science Curriculum effort (2015)
In October 2015, the National Science Foundation sponsored a workshop with representatives of
many perspectives on data science. Some attendees represented established programs, others
represented societies with an interest in data science. The final report, “Strengthening Data
Science Education Through Collaboration,” describes the discussions and reflects the diversity of
opinions. Although opinions varied, there were some areas of agreement. Those form the basis
of the list of Knowledge Areas in this current ACM report.
Summary
The review of existing curricular efforts suggests that it would be important to capture in a single
volume the contributions that computing makes to data science. Through developments such as
the Internet of Things, sophisticated sensors, face recognition and voice recognition, automation,
etc., computing opens up many avenues for data collection. It can play a vital role as a custodian
of information with great attention being paid to maintenance but crucially also to security and
confidentiality matters. Then the analysis of large amounts of information and utilization of that
for the purposes of machine learning or augmented intelligence in its various roles can bring
significant benefit.
1.3 Committee Work and Processes
The Data Science Task Force was initiated at a meeting of the ACM Education Council in
August 2017. The Co-Chairs were appointed at the meeting and were charged with developing a
charter for the work, as well as assembling a task force with global representation.
The Co-Chairs drafted a proposal to create the Task Force, which was approved by the ACM
Education Board in January 2018. The initial Task Force – approximately two-thirds of the
members of the current committee – convened for a full-day meeting in February 2018.
In preparation for a second face-to-face meeting in July 2018, the Task Force designed two
surveys to gather input from academia and industry on the computing competencies most central
to Data Science. The results of the survey are presented in this report, with details provided in
Appendix B. During this time, the Co-Chairs invited additional members to join the committee
and began to develop a global advisory group.
At the July 2018 meeting, the ACM Task Force developed the set of computing-focused
Knowledge Areas for Data Science that appear in this report and began to articulate
competencies in each of those areas.
With the release of this first draft report, the ACM Data Science Task Force is calling for
discussion and feedback from all data science constituencies. The Task Force will be presenting
the report and gathering comments at conferences and meetings, including Educational Advances
in Artificial Intelligence (EAAI-9), held at AAAI in January 2019; the SIGCSE Symposium in
February 2019; and the Joint Statistical Meetings in July 2019. The Task Force also welcomes
feedback by email to the Co-Chairs:
In order to gain an understanding of the current data science landscape, the ACM Data Science
Task Force conducted a survey of ACM members, representing academic institutions and
industry organizations. Through outreach to ACM members, the Task Force was also able to
reach computing professionals outside of ACM membership. In all cases, the Task Force sought
global participation. There were 672 responses to the academic survey and 297 responses to the
industry survey.
Academic Survey
The academic survey asked academics whether their institution had any sort of data science
program at the undergraduate level, asked what type of program was offered, in what
department(s) it was housed, and what computing areas were required, elective, or not present in
the program. It also allowed respondents to add to the list of computing areas specified in the
survey. Finally, the survey asked participants whether their data science program had a “data
science in context” requirement – i.e., a requirement that students apply data science to another
area.
Nearly half of respondents from academic institutions (47%) reported they did not offer an
undergraduate data science program. However, over half of those who reported offering some
type of program offered a full bachelor’s degree in data science.
Nearly all of the programs offering a bachelor’s degree in data science required courses in
programming skills and statistics. In addition, the majority of programs also required data
management principles, probability, data structures and algorithms, data visualisation, data
mining, and machine learning. Other courses included topics such as ethics, calculus, discrete
mathematics and linear algebra. We note that a majority of programs also required a “data
science in context” course.
Additionally, over half of these programs reported graduating 10 students or less annually.
We expect that the number of Data Science programs will increase, as will the number of
students choosing to study it. This, then, is an ideal time to articulate computing-based
competencies for those programs.
Industry Survey
The industry survey roughly mirrored the academic survey; however, the primary question was
whether a company looked for job applicants with data science experience and what computing
experience they required or preferred those applicants to have.
In the survey of industry representatives, nearly half (48%) responded that they look for
candidates specifically with data science or analytics degrees or educational backgrounds.
We found it particularly interesting that the majority of employers reported these employees
work as individual contributors on data science tasks.
Industry respondents reported requiring experience or skills in similar areas to those required by
college or university Data Science programs. One slight difference is that employers reported
requiring more computing skills than statistical or mathematical skills.
Other Observations
The ACM Task Force was somewhat surprised by certain survey results. For instance, industry
respondents did not report data security and privacy as a required competency area for job
applicants. We note that this may reflect employers’ understanding of what Data Science (and
Computer Science) programs are requiring of their majors. That is, it might reflect the reality of
the applicant pool, rather than a “wish list” of competencies.
Similarly, we note that academic institutions reported what they currently require, rather than
what they would require in an ideal world. This might, in some cases, reflect the availability of
courses and faculty at an institution, rather than a “gold standard” for Data Science programs.
Following the work of previous ACM curricular volumes (see [ACM 2013], for instance), this
report is organized around Knowledge Areas (KAs) whose origins are based on survey input (see
Section 1.4) as well as prior work, with special attention being given to the results of the
workshop reported in [CasTopi 2015].
The core computing discipline-specific Knowledge Areas for Data Science are:
Other areas of computing may merit attention: sensors and sensor networks, the Internet of
Things, vision systems, among others.
In addition, for a full curriculum the above need to be augmented with courses covering calculus,
discrete structures, probability theory, elementary statistics, advanced topics in statistics, and
linear algebra.
In addition to developing foundational skills in computing and statistics, data science students
should also learn to apply those skills to real applications. It is important for data science
education to incorporate real data used in an appropriate context.
Data Science curricula should include courses designed to promote dual coverage combining
both data science fundamentals and applications, exploring why people turn to data to explain
contextual phenomena. Such courses highlight how valuable context is in data analytics; where
data are viewed with narratives, and questions often arise about ethics and bias. It can be
beneficial to teach some courses with a disciplinary context so that students appreciate that data
science is not an abstract set of approaches. Related application disciplines might include
physics, biology, chemistry, the humanities, or other areas.
1.7 Competency Framework
The Competency Framework provides a framework for the description of the various Knowledge
Areas. Each KA is described by some preliminary descriptive material followed by a set of
topics and a set of associated competencies; levels of competence vary, with some requiring
greater expertise than others.
The details of the Competency Framework are described in Chapter 2. The descriptions of the
Knowledge Areas are then provided in Appendix A.
Those who study Data Science have to develop a mind set with a strong focus on data – the
collection of data and, through analysing it appropriately, using this to bring about beneficial
insights and changes. For instance:
• Obtaining data about the quality of air in a city can result in removing dangerous
pollution or sending warning messages to those who suffer from asthma.
• Collecting data about traffic in real time can result in steps being taken to avoid traffic
congestion.
• Collecting patient data can lead to new insights for disease diagnosis and treatment.
• Recording data about speech in a certain area can assist with speech recognition.
The possibilities are endless, and the contributions that Data Science can make to transforming
businesses, transforming society and basically shaping the future for the better are huge. The
possibilities also carry with them potentially negative consequences.
Students of Data Science need to be imbued with the ‘joy of data’, seeing data as the ‘currency
or fuel of our time’. They also need to be imbued with a strong sense of professional and ethical
responsibility. Data Science courses ought to reflect such sentiments; likewise the education of
data scientists.
The topic of careers is of course important from a marketing perspective. Suffice it to say that the
current demand is considerable and growing daily.
Having set the scene in this chapter, the second chapter sets out the Competency Framework
used in describing the various Knowledge Areas in some detail. The computing related KAs are
captured in Appendix A.
References
[ACM 2103] Computer Science Curricula 2013: Curriculum Guidelines for Undergraduate
Degree Programs in Computer Science (ACM/IEEE 2013):
https://www.acm.org/education/CS2013-final-report.pdf
[ASA 2014] Curriculum Guidelines for Undergraduate Programs in Statistical Science (ASA
2014b): http://www.amstat. org/education/pdfs/guidelines2014-11-15.pdf>
[BHEF 2106] Data Science and Analytics (DSA) Competency Map, Business The
Business Higher Education Framework (BHEF) version 1.0 produced in November 2016
[CasselTopi 2015] Strengthening Data Science through Collaboration, by Lillian Cassel and
Heikki Topi, Technical Report and report of 2015 NSF Workshop.
http://www.computingportal.org/sites/default/files/Data%20Science%20Education%20Worksho
p%20Report%20.0_0.pdf
[CUPM 2015] Curriculum Guide to Majors in the Mathematical Sciences (MAA 2015). See
http://www.maa.org/ sites/default/files/pdf/CUPM/pdf/CUPMguide_print.pdf
[Edison 2015] Data science professional uncovered: How the Edison project will
contribute to a widely accepted profile for data scientists, by Manieri, A.; Brewer, S.; Riestra, R.;
Demchenko, Y.; Hemmje, M.; Wiktorski, T.; Ferrari, T.; and Frey, J. published in IEEE 7th
International Conference on Cloud Computing Technology and Science (CloudCom), 588–593.
National Academies of Sciences, Engineering, and
Medicine. 2018.
[INF 2015] Business Analytics Curriculum for Undergraduate Majors, Coleen R. Wilder,
Ceyhun O. Ozgur (2015) published in INFORMS Transactions on Education 15(2):180-187.
https://doi.org/10.1287/ited.2014.0134
[NatAc 2018] Data Science for Undergraduates: Opportunities and Options, published by the
National Academies of Sciences, Engineering, and Medicine, 2018. Washington, DC: The
National Academies Press. https://doi.org/10.17226/25104
[Park City 2017] Curricular Guidelines for Undergraduate Programs in Data Science by
DeVeaux, R.; Agarwal, M.; Averett, M.; Baumer, B.; Bray, A.; Bressoud, T.; Bryant, L.; Cheng,
L.; Francis, A.; Gould, R.; Kim, A.; Kretchmar, M.; Lu, Q.; Moskol, A.; Nolan, D.; Pelayo, R.;
Raleigh, S.; Sethi, R.; Sondjaja, M.; Tiruviluamala, N.; Uhlig, P.; Washington, T.; Wesley, C.;
White, D.; and Ye, P. 2017. Annual Review of Statistics and Its Application 4:15–30.
Chapter 2: The Competency Framework
Much of the material in this chapter leans very heavily on (i.e., is taken verbatim from) the work
of IT2017 – see [IT2017]. The motivation for this is to maintain consistency across a set of
curricula documents produced by ACM.
Learning outcomes are written statements of what a learner is expected to know and be able to
demonstrate at the end of a learning unit (or cohesive set of units, course module, entire course,
or full program).
In contrast, with the wide agreement on the meaning of learning outcomes, there is extensive
confusion and vagueness around the terms competence and competency. Generally, the term
competence refers to the performance standards associated with a profession or membership to a
licensing organization. Assessing some level of performance in the workplace is frequently used
as a competence measure, which means measuring aspects of the job at which a person is
competent. Competencies are what a person brings to a job conceptualized as qualities by which
people demonstrate superior job performance [Kli1].
There is general agreement in education that success in college and career readiness requires that
students develop a range of qualities [Ken1, Nas1, Nrc1], typically organized along three
dimensions: knowledge, skills, and dispositions. We utilize a working definition of competency
that connects knowledge, skills, and dispositions. Figure 2.1 (adapted from IT2017, which is, in
turn, adapted from [Ccs1, p. 5]) shows these interrelated dimensions of competency.
KNOWLEDGE
• Mastery of content knowledge • Transfer of learning
SKILLS
• Capabilities and strategies for higher-order thinking
• Interactions with others and world around
DISPOSITIONS
• Personal qualities (socio-emotional skills, behaviors, attitudes) associated with success in
college and career
• Knowledge designates a proficiency in core concepts (or topics) and content of Data
Science and application of learning to new situations. This dimension usually gets most
of the attention from teachers, when they design their syllabi; from departments, when
they develop program curriculum; and from accreditation organizations, when they
articulate accreditation criteria.
• Skills refer to capabilities and strategies that develop over time, with deliberate practice
and through interactions with others and the real world. [Nrc1]. Skills also require
engagement in higher-order cognitive activities, meaning that “hands-on” practice of
skills join with a “minds-on” engagement. The inextricable connection between
knowledge and skills is evident in Michael Polanyi’s characterization of explicit versus
tacit knowledge [Pol1]. Explicit knowledge, or “know-that,” reflects core ideas and
principles, and corresponds to the knowledge dimension in our definition. Tacit
knowledge, or “know-how,” is a skillful action requiring sustained engagement and
practice. Problem-based assignments, real-world projects, and laboratory activities with
workplace relevance are examples of curriculum elements that focus on developing skills.
Well-designed syllabi and accredited programs are mindful of skill development when
they articulate student outcomes at course and program level.
A transmission theory of teaching, also known as teacher-focused, holds that knowledge emerges
as it transmits from the expert teacher to the inexpert learners with the objective of ‘getting it
across’ or covering all the topics in the material. The opposing theory of active learning is that
students themselves create meaning and develop understandings with the help of appropriately
designed learning activities. In undergraduate education, the active learning model underlies a
shift of the paradigm that has governed higher education institutions. The traditional paradigm of
providing instruction dominated by a passive lecture-based learning environment has shifted to
producing learning and creating experiences in which students are active participants in the
learning process [Bar1].
On a student learning continuum from passive (attending a standard lecture) to active (engaged
in problem solving with peers), to produce high level of student engagement means to design
learning activities in which students do more than taking notes, recalling, observing, or
describing. They learn more effectively when their active participation consists of asking
questions, applying concepts, discovering relationships, or generalizing a solution to new
situations [Big2]. Higher level of engagement cannot be encouraged if teaching is only about
declarative and procedural knowledge: information, vocabulary, basic concepts, basic knowhow,
and discrete skills [Wig1]. Indeed, students need the acquisition of knowledge and development
of basic skills, but this is just a means to a more important preparation for authentic performance
tasks and transfer of learning in new situations.
Perkins and Blythe formulated a “performance perspective” of learning and offered the view that
“understanding something is a matter of being able to carry out a variety of performances
concerning the topic.” [Bly1, Bly2] A performance perspective of learning requires a “modicum
of transfer, because it asks the learner to go beyond the information given” and seeks to “...
transcend the boundaries of the topic, the discipline, or the classroom.” [Per1]
The conventional way of framing curriculum guidelines for computing programs, has, until
recently, been content driven. A disciplinary body of knowledge decomposes into areas, units,
and topics to track recent developments in a rapidly changing computing field. For this report,
we follow the approach of the IT2017 report, which used the Understanding by Design (UbD)
framework [Wig1] to present a competency-based curricular framework.
The idea of the UbD framework is to treat content mastery as a means, not the end, to long-term
achievement gains that a program of study envisions for its graduates. Learners could know and
do many discrete things, but still not be able to see the bigger picture, put it all together in
context, and apply their learning autonomously in new situations.
In the UbD framework, learning transfer is multi-faceted as shown in Table 2.1 [Wig2]. We note
that these facets of learning transfer blended skills and dispositions. Explain, interpret, apply and
adjust are skills complemented by dispositions related to showing empathy, perceiving
sensitively, recognizing bias, considering various points of view, or reflecting on the meaning of
new learning and experiences. Dispositions relating to meta-cognitive awareness include being
responsible, adaptable, flexible, self-directed, and self-motivated, and having self-confidence,
integrity, and self-control. They also include how we work with others to achieve a common goal
or solution.
Table 2.1: Six facets of learning transfer adapted from Understanding by Design framework and
reproduced from IT2017.
Table 2.2: Performance verbs to generate ideas for performance goals and professional practice
(Reproduced from IT2017)
Demonstrat
Show Have Self-
Explain Interpret Apply e
Empathy Knowledge
Perspective
create
analogies
critique
adapt
document
build
evaluate
demonstrate create assume
illustrate
derive describe debug role of be
judge analyze
how design decide like
make sense argue be aware of
exhibit express design be open to
of make compare realize recognize
induce instruct exhibit believe
meaning of contrast reflect self-
justify model invent consider
provide criticize assess
predict prove perform imagine
metaphors infer
show how produce relate role
read
synthesize teach propose play
between the
solve test
lines
use
represent
tell a story
of translate
References
[Bar1] Barr, R.B. and Tagg, J. 1995. From Teaching to Learning: A New Paradigm for
Undergraduate Education. Change, 27(5), 12-25.
[Big2] Biggs. J. 1999. Teaching for Quality Learning at University – What the Student
Does (1st Edition), SRHE / Open University Press, Buckingham.
[Bly1] Blythe, T. 1998. The teaching for understanding guide. San Francisco: Jossey-
Bass. Blythe, T, and Perkins, D. 1988. Understanding understanding. In T.
[Bly2] Blythe (Ed.), The teaching for understanding guide, 9-16. San Francisco: Jossey-
Bass.
[Ccs1] Council of Chief State School Officers. 2013. Knowledge, Skills, and
Dispositions: The Innovation Lab Network State Framework for College, Career, and
Citizenship Readiness, and Implications for State Policy.
[Ken1] Kennedy, D., Hyland, Á., & Ryan, N. 2007. Writing and using learning
outcomes: a practical guide. Cork: University College Cork.
[Ken2] Kennedy, D., Hyland, A, and Ryan, N. 2009. Learning Outcomes and
Competences. Bologna Handbook. Introducing Bologna Objectives and Tools, B 2.3-3, 1-18.
[Kli1] Klink M. van der, Boon, J., and Schlusmans, K. 2007. Competences and
vocational higher education: Now and in future. European Journal of Vocational Training No 40
– 2007/1, 67-82.
[Nrc1] National Research Council. 2012. Education for Life and Work: Developing
Transferable Knowledge and Skills in the 21st Century. Washington, DC: The National
Academies Press. https://doi.org/10.17226/13398.
[Per2] Perkins, D., Jay, E., and Tishman, S. 1993. Beyond abilities: A dispositional
theory of thinking. Merrill-Palmer Quarterly, 39(1), 1-21.
[Pol1] Polanyi, M. 1966. The Tacit Dimension. University of Chicago Press: Chicago.
[Sch1] Schussler, D.L. 2006. Defining dispositions: Wading through murky waters. The
Teacher Educator, 41(4).
[Wig1] Wiggins, G.P., McTighe, J., and Ebrary, I. 2005. Understanding by design
(Expanded second edition). Alexandria, VA: Association for Supervision and Curriculum
Development.
[Wig2] Wiggins, G., and McTighe, J. 2011. The Understanding by Design Guide to
Creating High-Quality Units. Alexandria, VA: Association for Supervision and Curriculum
Development.
Appendix A: A Draft of Competencies for Data
Science
Computing Fundamentals
Data scientists should be able to implement and understand algorithms for data collection and
analysis. They should understand the time and space considerations of algorithms. They should
follow good design principles developing software, understanding the importance of those
principles for testability and maintainability.
Programming
Scope Competencies
A data scientist should know a variety of data structures, be able to use them, and understand the
implications of choosing one over another.
Scope Competencies
● Classification of data storage, ● Compare various data structures for
accessibility and complexities, a given problem, such as array, list,
based on implementation and set, map, stack, queue, hash table,
operation of domain-cluster tree, and graph
problems and/or applications ● Compare the trade-offs of different
● Analysis of a proper data structure representations of a matrix and
that suits data formats and common operations such as
constraints addition, subtraction, and
● Choice of adequate data structure multiplication
based on the preliminary ● Recognize data structures obtained
information about the data after called script-based subroutines
● Evaluate how efficient data
structure for the insert, remove, and
access operations
Algorithms
A data scientist should recognize that the choice of algorithm will have an impact on the time
and space required for a problem. A data scientist should be familiar with a range of algorithmic
techniques in order to select the appropriate one in a given situation.
Scope Competencies
● Problem solving through ● Analyze the differences between
algorithmic, computational and iterative vs recursive-based
statistical thinking. algorithms
● Algorithm design, implementation, ● Implement an efficient search
and analysis. algorithm to find a target with
● Comparison of various well-known certain characteristics
computing algorithms’ complexity, ● Provide the big Oh time and space
including machine learning (ML) for a given procedure
and statistics techniques. ● Evaluate best, average, and worst-
● Complexity of a given algorithm case behaviors of an algorithm.
● Factors that influence the algorithm ● Apply an appropriate algorithmic
complexity and performance approach to a given problem.
● Computational performance of ● Contrast which technique is more
certain algorithms based on appropriate to use based on a given
providing different data sets. scenario
Software Engineering
Software engineering principles include design, implementation and testing of programs. A data
scientist should understand design principles and their implications for issues such as
modularization, reusability, and security.
Scope Competencies
● Software engineering principles, ● Implement a small software project
including design, implementation that uses a defined coding standard.
and testing of programs. ● Incorporate statistical models into
● Principles of object-oriented design the software lifecycle
such as encapsulation, inheritance ● Evaluate results of a program by
and polymorphism to address utilizing statistical significance
concerns such as modularization, testing
reusability and security; ● Demonstrate how software interacts
● Principles of functional with various systems, including
programming to maintain complex information management,
scaling applications and embedded, process control, and
model/function composition communications systems
● Principles of compiled imperative ● Test a given piece of code by
programming for numeric including security, unit testing,
computations and scientific system testing, integration testing,
computing. and interface usability
● Probabilistic computing for testing
and software lifecycle
Data Acquirement and Governance
There can be no analysis of data without the data itself. A data scientist must understand the
source and quality of their data, as well as understand appropriate processes for acquiring and
maintaining high quality data.
Scope Competencies
● Shaping data and their ● Construct and tune the data
relationships. acquirement and governance
● Acquiring data from physical process according to the
world and extracting data to a requirements of applications,
form suitable for analysis. including the selection of data
● Integrating heterogeneous data sources and data acquirement
sources. equipment, data preparation
● Preprocessing and cleaning data algorithms and steps. (Process
for applications. Construction and Tuning)
● Define and write semantics rules
for data acquirement and
governance, including
information extraction, data
integration and data cleaning
(Rules Definition)
● Develop scalable and efficient
algorithms for data acquirement
and governance according to the
property of data and the
requirements of applications,
including data proper discovery,
data acquirement, information
extraction, data integration, data
sampling, data reduction, data
compression, data transformation
and data cleaning algorithms
(Algorithm Development)
● Describe and discover the static
and dynamic properties of data,
changing mechanisms of data
and similarity between data.
(Property Description and
Discovery)
Data Management
A data scientist must understand the storage, maintenance, and retrieval of data.
Data Management
Scope Competencies
● Storing and indexing (structured, ● Design the logical and physical
semi-structured and structure for effective data
unstructured) data management according to data
● Data models; query languages type, data model and application.
based on the data model ● Design index structure for
● Effective conceptual models and efficient query processing and
architectures for databases information retrieval.
● Data retrieval: queries, ● Describe the semantic
keywords, efficiency. requirements of data access in a
● Processing transactions in declarative language or a
database management systems keyword set.
● Scaling database management ● Tune and optimize the storage
systems. structure and query processing in
data management systems for
scalability and efficiency issues.
● Determine a strategy for
transaction processing to balance
efficiency, scalability and
consistency of data management
systems, especially for parallel
and distributed environments.
● Design scalable and efficient
algorithms for query processing,
query optimization, transaction
processing as well as information
retrieval.
Data Privacy, Security, and Integrity
Data Privacy
This is intended to provide students with an understanding of data privacy and its related
challenges. Students are expected to understand the tradeoffs of sharing and protecting sensitive
information; and how domestic and international privacy rights impact a company’s
responsibility for collecting, storing, and handling data. [xref: Professionalism: Privacy and
Confidentiality]
Scope Competencies
● Interdisciplinary tradeoffs of ● Evaluate and understand the
privacy and security. concept of privacy, including the
● Individual rights and impact on societal definition of what
needs of society. constitutes personally private
● Technologies to safeguard data information and the tradeoffs
privacy. between individual privacy and
● Relationships between security.
individuals, organizations, and ● Summarize the tradeoff between
governmental privacy the rights to privacy by the
requirements. individual versus the needs of
society.
● Evaluate common practices and
technologies, and identifying the
tools that reduce the risk of data
breaches while safeguarding data
privacy.
● Thoroughly comprehend how
organizations with international
engagement must consider
variances in privacy laws,
regulations, and standards across
the jurisdictions in which they
operate. This topic includes how
laws and technology intersect in
the context of the judicial
structures that are present –
international, national and local –
as organizations safeguard
information systems from
cyberattacks.
Data Security
This focuses on the protection of data at rest, during processing, and in transit. This area requires
the application of mathematical and analytical algorithms.
Scope Competencies
● Basic concepts in cryptography: ● Describe the purpose of
Encryption/decryption, sender cryptography and list ways it is used
authentication, data integrity, in data communications; and which
non-repudiation; Attack cryptographic protocols, tools and
techniques are appropriate for a
classification (ciphertext-only,
given situation. Describe the
known plaintext, chosen
following terms: cipher,
plaintext, chosen ciphertext); cryptanalysis, cryptographic
Secret key (symmetric), algorithm, and cryptology, and
cryptography and public-key describe the two basic methods
(asymmetric) cryptography. (ciphers) for transforming plaintext
● Role of mathematical techniques into ciphertext. Explain how public
for encryption. key infrastructure supports digital
● Role of symmetric (private key) signing and encryption and discuss
ciphers for data security. limitations/vulnerabilities.
● Role of asymmetric (public-key) ● Exhibit a mathematical
understanding of encryption
ciphers for data security.
algorithms, such as modular
● Cross-border privacy and data
arithmetic, Fermat, Euler theorems,
security laws.
primitive roots, discrete log
● What are the data security laws problem, primality testing, factoring
and how do they impact. large integers, elliptic curves,
● lattices and hard lattice problems,
abstract algebra, finite fields, and
information theory.
● Describe methods for data security,
such as block ciphers and stream
ciphers (pseudo-random
permutations, pseudo-random
generators), Feistel networks, Data
Encryption Standard (DES),
Advanced Encryption Standard
(AES).
● Describe how mathematical
concepts (such computational
complexity) contribute to
algorithms for data security.
● Explain requirements of the General
Data Protection Regulation
(GDPR); and Privacy Shield
agreement between countries, such
as the United States and the United
Kingdom, allowing the transfer of
personal data.
● Describe how certain laws [such as
the following in the USA: Section 5
of the U.S. Federal Trade
Commission, State data security
laws, State data-breach notification
laws, Health Insurance Portability
Accountability Act (HIPAA),
Gramm Leach Bliley Act (GLBA),
and Information sharing through
US-CERT, Cybersecurity Act of
2015] impact data security.
Data Integrity
Data integrity refers to the overall soundness, completeness, accuracy, and consistency of data.
Scope Competencies
● Approaches to the accuracy and ● Explain the concepts of message
consistency (validity) of data. authentication codes (HMAC,
CBC-MAC); Digital signatures;
Authenticated encryption, and
Hash trees that provide data
integrity.
Machine Learning
Machine learning refers to a broad set of algorithms and related concerns for discovering patterns
in data, making new inferences based on data, and generally improving the performance of a
software system without direct programming. These methods are critical for data science. Data
scientists should understand the algorithms they apply, be able to implement them, if necessary,
and make principled decisions about their use.
Machine Learning
Scope Competencies
• Broad categories of machine • Compare and contrast broad
learning approaches (e.g., classes of learning approaches,
supervised and unsupervised) that with a focus on inputs, outputs,
make assumptions about the data and ranges of problem types to
available at learning time and the which they can be applied.
general types of inferences that • Select and apply a broad range of
can be made from that data. machine learning
• Algorithms and tools (i.e., tools/implementations to real data.
implementations of those • Derive a (current) learning
algorithms) in each of the broad algorithm from first principles
learning categories. and/or justify a (current) learning
• Machine Learning as a set of algorithm from a mathematical,
principled algorithms (e.g., statistical, or information-
optimization algorithms), rather theoretic perspective.
than as a “bag of tricks.” • Express formally the
• Computational learning theory representational power of models
and what it tells us about the learned by an algorithm, and
theoretical limitations of various relate that to issues such as
approaches. expressiveness and overfitting.
• Notion of a hypothesis space of • Exhibit knowledge of methods to
learning outcomes and its mitigate the effects of overfitting
relationship to the expressive and curse of dimensionality in the
power of learned models. context of machine learning
• Problems related to model algorithms.
expressivity as well as availability • Provide an appropriate
of data, and techniques for performance metric for evaluating
mitigating their effects. E.g., machine learning algorithms/tools
problem of overfitting and for a given problem.
regularization techniques for • Apply appropriate empirical
mitigating effects of overfitting; evaluation methodology to assess
curse of dimensionality and the performance of a machine
feature learning algorithm/tool for a
selection/weighting/reformulation problem.
techniques for mitigating effects. • Apply appropriate empirical
• Ways to evaluate performance, evaluation methodology to
both in terms of specifying compare machine learning
objectives (e.g., predictive algorithms/tools to each other.
accuracy, cost-sensitivity, size of • Implement machine learning
learned model) and in techniques programs from their algorithmic
for measuring them. specifications.
• Methodology for evaluating the • Be aware of problems related to
model produced by a machine algorithmic and data bias, as well
learning algorithm/tool for a as privacy and integrity of data.
single problem; methodology for • Consider and evaluate the
empirically comparing algorithms possible effects -- both positive
against each other more generally. and negative -- of decisions
• Differences in interpretability of arising from machine learning
learned models. conclusions.
• Model drift over time. • Compare differences in
• Algorithmic and data bias, interpretability of learned models.
integrity of data, and professional
responsibility for fielding learned
models.
Data Mining
Data mining involves the application of machine learning and statistical techniques to extrapolate
information from data.
Data Mining
Scope Competencies
● Workflow of data mining and its ● Design data mining models for
relationship to data preparation specific data models according to
and data management applications (Model Design)
● Data mining models for a variety ● Design a data mining system,
of data models and applications including the system
● Design and analysis of data architecture, data process flow,
mining algorithms for various and data storage structure
data mining models (System Design)
● Develop efficient and scalable
data mining algorithms for
specific data models, data mining
models as well as data
management platforms
(Algorithm Development)
● Evaluate the significance and
usability of data mining results to
ensure that they may be applied
in real applications properly
(Result Evaluation)
Big Data
The term ‘Big Data’ has been coined to describe systems that are truly large. These introduce
problems of scale: how to store vast quantities of data, how to be certain the data is of high
quality, how to process that in ways that are efficient and how to derive insights that prove
useful. These matters are addressed below under the headings of problems of scale, complexity
theory, sampling, and concurrency and parallelism.
Problems of Scale
Scope Competencies
● Approaches to storing vast ● Explain the role of the storage
quantities of data hierarchy in dealing with Big
● Ensuring clean, consistent and Data
representative data ● Demonstrate how redundancy
● Protecting and maintaining the may be removed from a Big Data
data set
● Retrieval issues ● Illustrate the role of hashing in
● Problems of computation and the dealing with Big Data
efficiency of algorithms ● Illustrate the role of filtering in
● Specific techniques used in dealing with Big Data
addressing the problems of scale
Complexity Theory
Scope Competencies
● The notion of computational ● Explain why mathematical
complexity analysis alone is not always
● Limits to complexity sufficient in dealing with
● Evaluation of the complexity of efficiency considerations
algorithms ● Demonstrate how to evaluate the
● Selecting appropriate algorithms efficiency of an algorithm to be
used in processing Big Data
● Select algorithms appropriate to
a particular application involving
Big Data, taking account of the
problems of scale
Sampling and Filtering
Scope Competencies
● The role of sampling and ● Perform sample selection for a
filtering in the processing of Big particular application involving
Data Big Data
● Benefits of sampling / filtering ● List a variety of approaches to
● Criteria to be used in guiding filtering, illustrating their use
typical sample selection
Scope Competencies
● Concurrency and parallelism, ● Explain the limitations of
and distributed systems concurrency / parallelism in
● Limitations of parallelism dealing with problems of scale
including the overheads ● Identify the overheads associate
● Differing approaches to with parallelism in particular
addressing concurrency algorithms
● Complexity of parallel /
concurrent algorithms
Analysis and Presentation
The human computer interface provides the means whereby users interact with computer
systems. The quality of that interface significantly affects usability in all its forms and can
encompass a vast range of technologies: animation, visualisation, simulation, speech, video,
recognition (of faces, of hand writing, etc.) graphics. For the data scientist it is important to be
aware of the range of options and possibilities, and to be able to deploy these as appropriate.
Scope Competencies
• Importance of effectively • Explain data and inferences
presenting data, models, and made from data in oral, written,
inferences to clients in oral, and graphical formats.
written, and graphical formats. • Use standard APIs and tools to
• Visualization techniques for create visual displays of data,
exploring data and making including graphs, charts, tables,
inferences, as well as for and histograms.
presenting information to clients. • Apply a variety of visualization
• Effective visualizations for techniques to different types of
different types of data, including data. Make useful inferences /
time-varying data, spatial data, extract useful information from a
multivariate data, high- dataset using those techniques.
dimensional multivariate data, • Propose a suitable visualization
tree- or graph-structured data, design for a particular
and text. combination of data
• Knowing the audience: the client characteristics and application
or audience for a data science tasks.
project is not, in general, another • Analyze the effectiveness of a
data scientist. given visualization for a
• Human-Computer Interface particular task.
considerations for clients of data • Describe issues related to scaling
science products. data visualization from small to
large data sets.
• Be aware that the client (for an
interface or presentation) is often
not a data scientist.
• For an identified client,
undertake and document an
analysis of their needs.
Professionalism
In their technical activities, data scientists should behave in a responsible manner that brings
credit to the profession. One aspect of this is being positive and proactive in seeking to bring
benefit and doing so in a way that is responsible and ethical. Much of this is amplified in general
terms in [1]. This section below serves to highlight a number of relevant issues of specific
concern to the data scientist. A number of sub-areas are identified: continuing professional
development, communication, teamwork, economic considerations, privacy and confidentiality,
ethical issues, legal considerations, intellectual property, and on automation.
The essence of a professional is being competent in certain aspects of data science. It is the
responsibility of the professional to undertake only tasks for which they are competent. There are
then implications for keeping up-to-date in a manner that is demonstrable to interested parties,
e.g. employers.
Scope Competencies
● The meaning of competency and ● Justify the importance to the
being able to demonstrate professional data scientist of
competency maintaining competence.
● Acquiring expertise / mastery or ● Describe the steps that would
extending competency; the role typically have to be taken to
of journals, conferences, courses, extend competence or acquire
webinars mastery, explaining the
● Technological change and its advantages of the latter.
impact on competency ● Argue the importance of the role
● The role of professional societies of professional societies in
in CPD and professional activity supporting career development.
Communication
There are various contexts in which the data scientist is required to undertake communication
with very diverse audiences. That communication may be oral, written or electronic. There is the
need to be able to engage in discussion about the role that data science can play, to communicate
multiple aspects of the data science process with colleagues, to convey results that may lead to
change or may provide insights. Being able to articulate the need for change and being sensitive
to the consequences are important attributes. These activities may entail the ability to have a
discussion about limitations in a certain context and to suggest a research topic.
Scope Competencies
Teamwork
The data scientist will often act as a member of a team. This may entail being a team leader, or
supporting the work of a team. It is important to understand the nature of the different roles as
well as the typical dynamics of teams. So in terms of teamwork the data scientist needs to be able
to collaborate not only with data scientists with different tool sets but, in general, with a diverse
group of problem solvers.
Professionalism: Teamwork
Scope Competencies
● Team selection, the need to ● Document and justify the
complement abilities and skills considerations involved is
of team members selecting a team to undertake a
● The dynamics of teams and team specific data science
discipline investigation
● Elements of effective team ● Recognise the qualities desirable
operation in the team leader for a data
science research investigation
Economic considerations
Data scientists need to be able to justify their own positions as well as the kind of activity in
which they engage.
Scope Competencies
● The cost and value of high ● Predict the value of a particular
quality data sets, and the costs of data set to an organization,
maintenance taking into account any
● Justifying in cost terms data requirement for maintenance
science activity ● Argue the case for the data that
● Estimating the cost of projects an organization should routinely
● Promoting data science gather
● Automation ● Document the cost (in terms of
resources generally) of collecting
high quality data for a particular
purpose
● Justify or otherwise the creation
of a particular data science
activity within an organization
and quantify the cost
● Infer the value to an organization
of undertaking a particular
investigation or research project
● Document and quantify the
resources needed to carry out a
particular investigation in house
and compare this with
outsourcing the activity
● Evaluate and justify the costs
associated with the automation of
a particular activity
Privacy and Confidentiality [xref: Data Privacy, Security, Integrity]
It is possible to gain access to data in a multitude of ways, by accessing databases, using surveys
or questionnaires, taking account of conditions of access to some resource, and even with
developments such as the Internet of Things, specialized sensors, video capture and surveillance
systems. Although gaining access to all kinds of information is important, this must be done
legally and in such a way that the information is accurate and the rights of individuals, as well as
organizations and other groups, are protected.
Scope Competencies
● Freedom of information
● Data protection regulations ● Describe technical mechanisms
including GDPR – see [5] for maintaining the
● Privacy legislation confidentiality of data
● Maintaining the confidentiality ● Compare the privacy legislation
of data in two specific countries, and
● Threats to privacy and indicate the problems arising
confidentiality from the differences
● The international dimension ● Recognize the privacy and
confidentiality issues arising
from the use of video and face
recognition software
Ethical issues
Ethical issues are of vital importance for all involved in computing and information
activities. Such issues are captured extensively in [1]. Underpinning these is the view that a
professional should undertake only tasks for which they are competent, and even then should
carry out such tasks in a way that reflects good practice in its many forms. Maintaining or
extending competence is essential. A heightened awareness of legal and ethical issues must
underpin the work of the data scientist.
Teaching students to consider the ethical issues associated with their decisions is a very
important starting point, enabling them to recognize themselves as “independent, ethical
agents.”
Scope Competencies
● Confidentiality issues associated ● Demonstrate techniques for
with data and its use establishing lack of bias in a set
● The General Data Protection of data or in algorithms
Regulation (GDPR) regulation – ● Create a technical paper on an
see [5] aspect of data science for
● The need for data to be truly colleagues
representative ● Reflect on a network of
● Bias in data and in algorithms; professionals in the data science
mechanisms for checking area and outline the advantages
to be gained by joining such a
network
Legal Considerations
Computer crime has continued to increase both in volume and its severity over recent
years. This has brought disruption, even chaos, to many organizations. The threat of computer
crime cannot be ignored and steps need to be taken to counter the possibility of severe
disruption. The law has adjusted to counter these trends but this is an ongoing area of change and
adjustment.
Scope Competencies
● Computer crime – examples of
most relevance to data science ● Illustrate and evaluate a range of
● Cyber security mechanisms for detecting a
● Crime prevention stated form of criminal activity
● Mechanisms for detecting ● Justify the desirability of having
criminal activity, including the multiple diverse approaches to
need for diverse approaches countering threats
● Recovery mechanisms and
maintaining 100% operation
● Laws to counter computer crime
Intellectual Property (IP)
Intellectual Property rights such as copyright, patents, designs, trademarks and moral rights, exist
to protect the creators or owners of creations of the human mind; moral rights include the right to
be named as a creator of IP, and the right to avoid derogatory treatment of creations. For the
data scientist the items to be protected, in possibly different ways, include software, designs
(including GUIs), data sets, moral rights and reputation. Trade secrets may also be relevant.
Scope Competencies
One possible outcome of a data science investigation is that strategic change is needed in an
organization. The change suggested may be minimalistic at one extreme or transformational at
the other. The data scientist needs to be alert to the range of possibilities and, perhaps by
engaging other experts, be in a position to offer advice and guidance about how to move forward
with such change, to advise on the consequences and to outline and quantify the resources that
will be required to deliver on the change.
Scope Competencies
● The need for strategic change, ● Justify with evidence the need
the role of simulation for strategic change within an
● Structural change, organisation and recognise the
transformational change nature of the change required
● Strategies for delivering effective (e.g. personnel change, structural
change including top-down and change, transformational change)
bottom up approaches ● Provide a set of feasible
● Resource needs associated with approaches about how to deal
change with transformational change in a
● People issues associated with given situation, to quantify the
change, managing resistance to resources needed and to highlight
change, the role of human benefits
resources and communication ● Outline a range of strategies for
● Monitoring the effectiveness of managing resistance to change
change ● Identify teamwork, leadership
● The role of automation and personnel issues associated
with change including gaining
support from management,
maintaining operation while
implementing change and
addressing loss of employment
● Recognise the issues associated
with change brought about by
automation (including ethical
issues), with possible need for
back-up
On Automation
Automation often creates concerns about loss of employment opportunities and, in general terms,
about machines behaving unreasonably; explanations from machines about their behavior may
be sought. Related issues are the subject of [3] and [6]. Automation can occur in critical
situations where serious loss may be possible, and then typically there is an expectation that
machines will operate to a code of ethics in sympathy with that of humans.
Professionalism: On Automation
Scope Competencies
● Automation, its benefits and its ● Analyze the impact on design of
justification the requirement to provide
● The particular concerns of insights into decisions made
automation in critical situations autonomously by machines
● Transparency and accountability ● Argue the benefits of automation
in algorithms in particular situations
References
[1] The ACM Code of Ethics and Professional Conduct, published by ACM on 17 July th
[3] ACM US Public Policy Council and ACM Europe Policy Council, “Statement on
Algorithmic Transparency and Accountability,” 2017.
[6] Simson Garfinkel, Jeanna Mathews, Stuart S. Shapiro, Jonathan M. Smith, Towards
Algorithmic Transparency and Accountability, Communications of the ACM, September
2017,vol. 60, no. 9, page 5.
Appendix B: A Summary of Survey Responses