Data Science and Machine Learning in Education
Data Science and Machine Learning in Education
ABSTRACT
The growing role of data science (DS) and machine learning (ML) in high-energy physics
(HEP) is well established and pertinent given the complex detectors, large data, sets
and sophisticated analyses at the heart of HEP research. Moreover, exploiting symme-
tries inherent in physics data have inspired physics-informed ML as a vibrant sub-field
of computer science research. HEP researchers benefit greatly from materials widely
available materials for use in education, training and workforce development. They are
also contributing to these materials and providing software to DS/ML-related fields. In-
creasingly, physics departments are offering courses at the intersection of DS, ML and
physics, often using curricula developed by HEP researchers and involving open software
and data used in HEP. In this white paper, we explore synergies between HEP research
and DS/ML education, discuss opportunities and challenges at this intersection, and
propose community activities that will be mutually beneficial.
1
1 Introduction
The particle physics research community has a strong background and involvement in edu-
cational activities. Not only do many of its practitioners come from universities and centers
for education, but the community also provides training and educational resources to facil-
itate our science and convey its importance to members of the public and policy makers.
Particle physics holds a prominent role within academic curriculum at institutions of
learning. There are compelling reasons for this prominent role, such as the fundamental
nature of our science, fascinating historical development of our field, theoretical research
that applies (and often develops) advanced mathematics, powerful applications such as
cancer treatment, and high-visibility spin-off technologies such as the World Wide Web.
Data science and machine learning have an increasingly prominent role in our science,
as is evident from any recent particle physics conference and this Snowmass process. In
recent years, machine learning techniques for detector and accelerator control [1], data sim-
ulation [2], parton distribution functions [3], reconstruction [4–6], anomaly detection [7, 8]
and data analysis are increasing being applied to particle physics research. Recent reviews
of these techniques applied to particle physics research can be found in [9–18]. A “liv-
ing review” aiming to provide a comprehensive list of citations for those in the particle
physics community developing and applying machine learning approaches to experimental,
phenomenological, or theoretical analyses can be found in [19].
There is no consensus on the precise definitions of data science and machine learning. For
our purposes, we consider data science to refer to scientific approaches, processes, algorithms
and systems used to extract meaning and insights from data [20] and machine learning to
refer to techniques used by data scientists that allow computers to learn from data. Machine
learning is a subset of the field of artificial intelligence which aims to develop systems that
can make decisions typically requiring human-level expertise, possessing the qualities of
intentionality, intelligence and adaptability [21]. Figure 1 illustrates these relationships.
Particle physicists increasingly collab-
orate with computer scientists and indus-
try partners to develop “physics-driven” or
“physics-inspired” machine learning archi-
tectures and methods. However, the parti-
cle physics community in the U.S. has been
generally slow to adopt data science and
machine learning as formal components in
educational curriculum. This situation is
rapidly improving. The potential synergies
between education and particle physics re-
search in the areas of data science and ma-
chine learning motivates this study. Figure 1: Simplified illustration of data sci-
In this Snowmass white paper, we ex- ence and machine learning in the context of
plore some of the challenges and opportuni- data, models and statistical inference.
ties for data science and machine learning in
education and suggest future directions that could benefit the particle physics community.
2
2 Educational Pathways
There are many ways that particle physics researchers at all levels provide educational
opportunities to students and other trainees. Mentoring in research activities is a key
educational delivery method that provides an opportunity for professional development of
early career individuals and bi-directional learning inherent in scientific research.
In addition to traditional advising of undergraduate and graduate in thesis research,
there are dedicated programs such as the NSF Research Experience for Undergraduates [22],
masterclasses (E.g. [23]), and capstone project-based courses (E.g. [24]). These programs
are multidisciplinary and often involve direct involvement with industry which create strong
opportunities for students to pursue careers within or beyond academia.
Particle physicists also develop curriculum at the intersection of data science, machine
learning and physics for use in undergraduate and graduate-level courses in their home
department(s). These curriculum might represent whole courses or specific teaching modules
which could be used in multiple courses or other forms of educational delivery. A few
examples are provided in Sec. 3. These materials are often drawn from and feed into
educational and training materials from particle physics research, such as schools (e.g. [25–
27]), training events (e.g. [28, 29]), workshops (e.g [30–34]) and bootcamps (e.g. [35, 36]).
3
topics (by the domain mentors). During the second quarter, students propose and exe-
cute a new project that extends the work of the previous quarter. Possible projects include
studying the performance of different message passing and graph neural network structures,
studying mass decorrelation strategies, applying explainable AI techniques (like layerwise
relevance propagation) to the Higgs tagging task, comparing multiclassification to binary
classification, and developing a network for Higgs boson jet mass regression.
Physics and Data Science (MIT) [44]: This is a course developed by Phil Harris and aims
to present modern computational methods by providing realistic examples of how these
computational methods apply to physics research. Topics include: Poisson statistics, error
propagation, fitting, data analysis statistical measures, hypothesis testing, semi-parametric
fitting, deep learning, Monte Carlo simulation techniques, Markov-chain Monte Carlo, and
numerical differential equations. The class format is a mixture of lectures by course faculty
(and guest speakers to highlight the relevance of this work towards department research),
recitations run by undergraduate and graduate students, and completion of three projects
with a final presentation based on an extension of any one of the three projects. The
projects include data analysis in gravitational waves, cosmic microwave background, and
LHC jet physics using open data.
Introduction to Machine Learning (Princeton) [45]: This is a course developed by Savannah
Thais and is primarily intended for non-computer science students who want to understand
the foundations of building and testing an ML pipeline, different model types, important
considerations in data and model design, and the role ML plays in research and society.
Topics covered in lectures and exercises include conceptual foundations of ML, artificial
neural networks, convolutional and recurrent neural networks, unsupervised learning, gen-
erative models and topics in AI ethics such as data bias, algorithmic auditing, predictive
policing, inequitable utilization of algorithms, proposed regulation.
Data Analysis and Machine Learning Applications for Physicists (Illinois) [46]: This is a
course developed by Mark Neubauer which aims to teach the fundamentals of analyzing
and interpreting scientific data and applying modern machine learning tools and techniques
to problems commonly encounters in physics research. The class format is a combination
of lectures, homework problems that elaborate on topics from the lectures that give stu-
dents hand-on experience with data, and a final project that students can choose from in
the areas of particle physics and astrophysics. Topics covered include handling, visualizing
and finding structure in data, adapting linear methods to nonlinear problems, density esti-
mation, Bayesian statistics, Markov-chain Monte Carlo, variational inference, probabilistic
programming, Bayesian model selection, artificial neural networks and deep learning.
4
language with automatic memory management, simplicity of its syntax and readability of
its code as compared with most other languages. Python is a binary platform-independent
language such that it can be run on virtually any hardware platform and operating system.
This aspect is important in an educational setting where students use a variety computing
systems. Python is an open source project that is free to use, with an extensive ecosystem
of code libraries for applications in science, engineering, data science and machine learning.
The example courses described make extensive use of libraries used in scientific comput-
ing, mathematics, and statistics such as NumPy [47], Pandas [48, 49], matplotlib [50] and
seaborn [51]. SciPy [52] provides algorithms for optimization, integration, interpolation,
eigenvalue problems, algebraic equations, differential equations, statistics and many other
classes of problems. In terms of machine learning, these courses use the general purpose
scikit-learn library [53], as well as deep learning frameworks such as PyTorch [54] and
TensorFlow [55].
Python as a language for data science and machine learning has broad community
support. Therefore, a key benefit of using Python in the classroom in terms of professional
and skills development is that it is a language used extensively in real-world applications of
data science and machine learning and widely used in industry. Of course, this is only the
present landscape and it is anyone’s guess as how it will change on the 5–10 year timescale.
Data The specific data used in the example courses described in this paper vary according
to the exact lessons and projects being taught. However, the courses generally made use
of open scientific data sources when appropriate. This is especially relevant for the project
components of these courses. For example, open data resources at the UCI Machine Learn-
ing Repository [56], Galaxy Zoo challenge data [57], and CERN Open Data Portal [58] were
used for HEP and astrophysics students projects in the course at Illinois.
Tools The use of Jupyter notebooks was a common aspect of the example course described
in this section. Jupyter notebooks provide an interactive front-end to a rich ecosystem
of python packages that support machine learning and data science tasks. They provide
a means for students and instructors to create and share documents that integrate code,
LATEX-style equations, computational output, data visualizations, multimedia, and explana-
tory text formatted in markdown into a single document. When hosted by a cloud-based
server resource such as JupyterLab, using these notebooks has huge benefits for teaching,
including removing the need to install any software locally or require any specific machine
to be used by students [59].
Course Materials The reference materials used in the courses were education and training
materials that are widely available in the public domain and enhanced by a significant
amount of supplementary resources linked on the course pages. In the Illinois course,
all materials are managed through a dedicated Github Organization. The students and
course staff are all members of this organization, with different access levels to material
(repositories) according to their role. Students each create a private repository which is
how they submit their homework and final projects for grading.
Infrastructure As with data used in the example courses, the infrastructure utilized for
course delivery varied according to the specific needs of the courses and institutional ar-
rangements. In general, the courses used open source software in the Python language to
5
implement scientific codes, and commonly used machine learning frameworks and libraries
within Jupyter notebooks. The Python code that the students developed could be executed
in a number of ways within these courses. For example, a common course environment could
be generated by package management software (e.g. Anaconda [60]) or a Linux container
service (e.g. Docker [61]). Another approach was to use an execution environment such as
Google Colab [62] which allows anybody to write and execute arbitrary python code through
the browser, and is especially well suited to machine learning, data analysis and education.
More technically, Colab is a hosted Jupyter notebook service that requires no setup to use,
while providing access free of charge to computing resources including GPUs [62]. In the
Illinois course example, a custom Docker container [63] maintained by the course staff is
launched onto commercial cloud resources which used to serve notebooks for the students
using JupyterLab and provide computational resources to execute the code.
Delivery The primary methods of content delivery and active student engagement varied
by course but generally involved a mixture of lectures by course faculty that included
physics and data science pedagogy demonstrated through in-class live examples in Jupyter
notebooks, recitation/discussion style activities involving hands-on interactive exercises,
and projects. In the MIT course, guest speakers were invited to highlight the relevance of
the pedagogy with ongoing research in the department. In the UCSD course, the students
were actively involved in proposing and executing a new project that extends the work of
the previous quarter.
4 Opportunities
Physics departments are increasingly offering curriculum to their undergraduate and grad-
uate students at the intersection of physics, data science and machine learning. Particle
physicists are increasingly interested in developing new courses at this intersection. For
those so inclined, these courses provide opportunities for particle physicists to (1) describe
synergies between modern machine learning research and particle physics research, (2) make
connections with colleagues from other departments, (3) make connections within their own
department in other research domains, (4) recruit students interested research at the inter-
section of machine and particle physics, and (5) learn the tools and techniques from data
science and machine learning that can be applied to particle physics research.
There are opportunities to take advantage of programs in education from federal agencies
and engage with key organizations, such as the American Physical Society’s (APS) Topical
Group on Data Science (GDS) [64]. The APS GDS is focused on promoting research
at the growing interface between physics and data science, spanning big data, machine
learning, and artificial intelligence, with relevance to HEP and other scientific domains such
as astronomy and materials science. The Data Science Education Community of Practice
(DSECOP) [65], a program funded by the APS Innovation Fund and led by the APS Group
on Data Science (GDS), seeks to support physics educators in integrating data science in
their courses. DSECOP achieves this through
• A Slack community of physics educators and industry professionals to discuss data sci-
ence education in physics courses. Specifically, conversation will be around challenges,
opportunities, and cutting edge skills necessary for a wide-range of jobs.
6
• Workshops [66] to promote shared understanding and solidify the community
The DSECOP Fellow program is a good example of how researches at an early career
stage can be strongly involved in curriculum development. In several of the example courses
described in Sec. 3, students and postdocs from the instructors research group were involved
in curriculum development and course delivery as part of their professional development.
The same is true for several of the training and bootcamp events described in Sec. 2.
Large NSF Institutes such as the Institute for Research and Innovation in Software for
High Energy Physics (IRIS-HEP) [68], AI Institute for Artificial Intelligence and Funda-
mental Interactions (IAIFI) [69], and the Accelerated Artificial Intelligence Algorithms for
Data-Driven Discovery Institute (A3D3) [70] have HEP as a research driver and substantial
efforts in education and training. The development of course curriculum for data science
and machine learning is synergistic with particle physics research efforts within these and
other institutes.
4.1 What does HEP research have to offer for ML/DS Education?
HEP research has much to offer education in data science and machine learning. Research
in HEP has long required advanced, cutting-edge computing techniques, and physicists
have historically contributed to the development of these methods. Over the past decades,
there have been great advances in the data processing power of machine learning algorithms
and these methods have quickly been adopted by physicists to address the unique timing,
memory, and latency constraints of HEP experiments.
Our science typically involves analysis of large datasets generated by complex instru-
ments at the frontier of scientific research. It is enabled by application of machine learning
methods and data science tools which helps demonstrate the power and importance of these
in a scientific research setting as well better understand their limitations. In recent times,
cutting edge research in ML methods have been tested for their effectiveness and scalability
in problems that interest high energy physics. For instance, generative models have found
application in calorimeter simulation [71, 72], graph neural networks (GNNs) have been ex-
plored for particle flow reconstruction [73], and jet classification [74]. These applications
serve as compelling evidence of wide range applicability of ML models for large, compli-
cated datasets. Incorporating these exercises in ML pedagogy can enable the students to
learn about the analytical and practical aspects of implementation of complex models that
include hyperparameter optimization, data and model parallelization, uncertainty quantifi-
cation, and model interpretation.
In short, in particle physics we have some of the most compelling scientific applications
of data science and machine learning that involve very large and complex datasets.
Particle physics is also impacting machine learning research and therefore machine learn-
ing education. The constraints of HEP experiments and known symmetries of physical sys-
tems create a rich environment for the development of novel and physics-informed machine
7
learning (see for example [75–78]). There are even entire conferences and workshops dedi-
cated to this intersection including the Microsoft Physics ∩ ML lecture series [32] and the
ML and the Physical Sciences workshop at NeurIPS [33, 34].
Exploration of machine learning models within the domain of high energy physics goes
beyond usual regression and classification problems. The pursuit of discovery in physics
requires explicit understanding of the causal relationship between the inputs and outputs
of an analysis model and classical ML techniques that help understand such relationships-
polynomial regression models, decision trees, and random forests for instance, have found
numerous applications in physics problems, including parameterized cross-section estima-
tion for novel physics models [79] and event classification for H → γγ channel in search of
Higgs boson at the LHC [80]. However, deep neural networks are becoming increasingly pop-
ular and showing improved performance over simpler ML techniques (see for instance [81]
for comparison of classical and deep neural techniques for identifying longitudinal polar-
ization fraction in same-sign W W production). These complex, highly nonlinear models
often comprising O(100k) parameters are extremely difficult to explain and often regarded
as black box surrogates. Recent literature has focused towards explainable AI (xAI) and
a number of methods [82–87] have been explored to identify importance of features and
intermediate hidden layers in the context of a wide range of deep neural network models.
Application of these methods in HEP research [88,89] is quickly becoming popular, as model
interpretability remains a crucial aspect to determine the relevance of physics insights for
ML models. With a long history of using interpretable models for physics research, HEP
research allows validation of xAI techniques for mainstream ML research and pedagogy.
8
fields of study such as HEP, multimessenger astronomy and computational neuroscience.
Such endeavors help make HEP research more visible to multiple communities, developing
a broader interest among students and researchers in learning about and solving problems
in HEP.
Finally, as all instructors know, from teaching assistant to professor levels, teaching a
course really helps one understand at a deeper level the material being taught. This is
true for traditional physics curriculum as well as courses at the intersection of physics, data
science and machine learning. Teaching in this broader space makes us better researchers
in particle physics as practitioners of these tools and methods, and keeping up with current
developments (to some limited degree).
5 Challenges
There a numerous challenges confronting particle physicists incorporating data science and
machine learning into the physics curriculum. We touch on a few of them in this section
and include some ideas for mitigating the risks associated with these challenges.
9
5.3 Distinction and Adoption
We want to make sure that courses developed in the physics department distinguish them-
selves from other courses and are valued by the students they are designed to teach. Two
suggestions are as follows:
1. Not trying to do too much. Our strengths in HEP lie in the analysis and inter-
pretation of large scientific data sets and physics-inspired AI. Leave the foundational
AI pedagogy to the CS courses.
2. Balancing physics and ML pedagogy. Remember that its a physics course taught
in the Physics Department. It’s best to use as many physics examples and datasets
to support your instruction as possible. I.e. Classification of jets and galaxies over
cats and dogs.
10
The most important consideration in terms of tools and infrastructure is to have these
element work well without detracting from the learning environment. As a simplistic ex-
ample, if I want to use a lamp to illuminate a room, I just want the lamp to be functional
and the electrical infrastructure to work without being distracted with all the details of how
electrical current arrives at the outlet (of course, that is interesting in other contexts). The
same is true for a course in physics and machine learning.
Software and tools to launch student code within notebooks on CPUs, GPUs, and other
resources to study aspects of physics should be open, portable, robust and easy to make
physics education using DS/ML most effective.
Containers are a great technology to provide custom, course-specific software and data
environments for use by student on infrastructure. These custom containers can be hosted
externally and launched on cloud-based services to execute student notebooks. Students just
write code in notebooks with a common software environment and the backend resources
are provisioned for cell executions. Further customization of the run-time environment is of
course possible with the appropriate instructions in the notebook (e.g. via pip install).
All of these functions are available with existing technologies. However, it is important
for universities to provide support to educators, who are most often not experts in these
technologies, in maintaining a working environment of tools and infrastructure. Increased
sharing of experiences with the tools and infrastructure used for education in data science
and machine learning among researchers in HEP and other fields is strongly encouraged.
11
share the experiences of this type of teaching and general discovery of who has developed
and taught such courses at their institutions. Answering questions like:
• What courses have been developed and delivered by our community?
• What open data is available for possible use in ML/DS education?
• What training/education/bootcamp/hackathon materials already exist?
How can we better collect and expose the above information for use in our community?
The information is available but diffuse and therefore some coordination would make the
sharing of knowledge and experiences much more efficient. This type of coordinated effort
could make the sharing and improvement of projects utilizing HEP data more efficient.
7 Conclusions
We believe it will become ever more crucial that both our young and experienced researchers
have a working understanding of data science and machine learning tools. We would like to
see a continuation of community efforts towards raising he level of ML proficiency among
current researchers by providing in-depth and innovative schools and other training events.
It would also be advantageous to see more movement towards the addition of DS and
ML studies within the physics curriculum at our educational institutions. All of this will
take cooperation from the entire HEP community. We have described in this white paper
some of the experiences from HEP researchers in ML education, outlined opportunities and
challenges, and recommended future directions to make this area more efficient and effective
for the HEP community.
References
[1] J. St. John et al., Real-time artificial intelligence for accelerator control: A study at
the Fermilab Booster, Phys. Rev. Accel. Beams 24 (2021) 104601 [2011.07371].
[2] A. Butter and T. Plehn, Generative Networks for LHC events, 2008.08558.
[3] S. Forte and S. Carrazza, Parton distribution functions, 2008.12305.
[4] A. Butter et al., The Machine Learning landscape of top taggers, SciPost Phys. 7
(2019) 014 [1902.09914].
[5] J. Duarte and J.-R. Vlimant, Graph neural networks for particle tracking and
reconstruction, in Artificial Intelligence for High Energy Physics, P. Calafiura,
D. Rousseau and K. Terao, eds., p. 387, World Scientific (2022), DOI [2012.01249].
[6] Z.A. Elkarghli, Improvement of the NOvA Near Detector Event Reconstruction and
Primary Vertexing through the Application of Machine Learning Methods, Master’s
thesis, Wichita State U., 2020, [2112.01494].
[7] B. Nachman, Anomaly Detection for Physics Analysis and Less than Supervised
Learning, 2010.14554.
12
[8] Y. Alanazi, N. Sato, P. Ambrozewicz, A.N.H. Blin, W. Melnitchouk, M. Battaglieri
et al., A survey of machine learning-based physics event generation, 2106.00643.
[9] K. Albertsson et al., Machine Learning in High Energy Physics Community White
Paper, J. Phys. Conf. Ser. 1085 (2018) 022008 [1807.02876].
[10] D. Guest, K. Cranmer and D. Whiteson, Deep Learning and its Application to LHC
Physics, Ann. Rev. Nucl. Part. Sci. 68 (2018) 161 [1806.11484].
[11] Alexander Radovic, Mike Williams, David Rousseau, Michael Kagan, Daniele
Bonacorsi, Alexander Himmel, Adam Aurisano, Kazuhiro Terao, and Taritree
Wongjirad, Machine learning at the energy and intensity frontiers of particle
physics, Nature 560 (2018) 41.
[12] D. Bourilkov, Machine and Deep Learning Applications in Particle Physics, Int. J.
Mod. Phys. A 34 (2020) 1930019 [1912.08245].
[14] A.J. Larkoski, I. Moult and B. Nachman, Jet Substructure at the Large Hadron
Collider: A Review of Recent Advances in Theory and Machine Learning, Phys.
Rept. 841 (2020) 1 [1709.04464].
[16] M.D. Schwartz, Modern Machine Learning and Particle Physics, Harv. Data Sci.
Rev. 3 (2021) [2103.12226].
[18] A.M. Deiana et al., Applications and techniques for fast machine learning in science,
Front. Big Data 5 (2022) 787421 [2110.13041].
[19] M. Feickert and B. Nachman, A Living Review of Machine Learning for Particle
Physics, 2102.02770.
[20] V. Dhar, Data science and prediction, Communications of the ACM 56 (2013) 64.
13
[24] “UCSD Data Science Capstone.” https://dsc-capstone.github.io.
[25] CMS Collaboration, “2020 CMS Data Analysis School.” (accessed on August 28,
2020) https://lpc.fnal.gov/programs/schools-workshops/cmsdas.shtml.
[26] “2020 Hands-on Advanced Tutorial Sessions at the LPC.” (accessed on August 28,
2020)https://lpc.fnal.gov/programs/schools-workshops/hats.shtml.
[27] “CERN Summer School.” https://home.cern/summer-student-programme.
[28] HEP Software Foundation Software Training Center website.
https://hepsoftwarefoundation.org/training/curriculum.html.
[29] “Computational and data science training for high energy physics.”
https://codas-hep.org.
[30] D.S. Katz et al., Software Sustainability & High Energy Physics, in Sustainable
Software in HEP, 10, 2020, DOI [2010.05102].
[31] “2020 ML4Jets workshop 2020.” (accessed on August 28,
2020)https://iris-hep.org/projects/ml4jets.html.
[32] Microsoft, “Physics Meets ML Lecture Series.” http://physicsmeetsml.org.
[33] “2020 Machine Learning and the Physical Sciences Workshop.”
https://ml4physicalsciences.github.io/2020.
[34] “2021 Machine Learning and the Physical Sciences Workshop.”
https://ml4physicalsciences.github.io/2021.
[35] The 2019 US-ATLAS Computing Bootcamp website.
https://sammeehan.com/2019-08-19-usatlas-computing-bootcamp.
[36] The 2020 US-ATLAS Computing Bootcamp website.
https://indico.cern.ch/event/933434.
[37] Pankaj Mehta, “Machine Learning for Physicists.”
http://physics.bu.edu/~pankajm/PY895-ML.html.
[38] Michael Coughlin, “Big Data in Astrophysics.”
https://github.com/mcoughlin/ast8581_2022_Spring.
[39] M. Erdmann, J. Glombitza, G. Kasieczka and U. Klemradt, Deep Learning for
Physics Research, World Scientific (2021), 10.1142/12294,
[https://www.worldscientific.com/doi/pdf/10.1142/12294].
[40] P. Calafiura, D. Rousseau and K. Terao, Artificial Intelligence for High Energy
Physics, World Scientific (2022), 10.1142/12200.
[41] P. Mehta, M. Bukov, C.-H. Wang, A.G. Day, C. Richardson, C.K. Fisher et al., A
high-bias, low-variance introduction to machine learning for physicists, Phys. Rep.
810 (2019) 1 [1803.08823].
14
[42] Javier Duarte, “Particle Physics and Machine Learning.”
https://jduarte.physics.ucsd.edu/capstone-particle-physics-domain.
10.5281/zenodo.4768815.
[43] E.A. Moreno, T.Q. Nguyen, J.-R. Vlimant, O. Cerri, H.B. Newman, A. Periwal
et al., Interaction networks for the identification of boosted H → bb decays, Phys.
Rev. D 102 (2020) 012010 [1909.12285].
[47] C.R. Harris, K.J. Millman, S.J. van der Walt, R. Gommers, P. Virtanen,
D. Cournapeau et al., Array programming with NumPy, Nature 585 (2020) 357.
[49] Wes McKinney, Data Structures for Statistical Computing in Python, in Proceedings
of the 9th Python in Science Conference, Stéfan van der Walt and Jarrod Millman,
eds., pp. 56 – 61, 2010, DOI.
[51] M.L. Waskom, seaborn: statistical data visualization, Journal of Open Source
Software 6 (2021) 3021.
15
[56] “UCI machine learning repository.” https://archive.ics.uci.edu/ml/datasets.
[59] E. Van Dusen, Jupyter for teaching data science, in SIGCSE ’21: Proceedings of the
52nd ACM Technical Symposium on Computer Science Education, p. 1359, 2021.
[68] “Institute for Research and Innovation in Software for High-Energy Physics.”
https://iris-hep.org.
[73] J. Pata, J. Duarte, J.-R. Vlimant, M. Pierini and M. Spiropulu, MLPF: Efficient
machine-learned particle-flow reconstruction using graph neural networks, Eur. Phys.
J. C 81 (2021) 381 [2101.08578].
16
[74] E.A. Moreno, T.Q. Nguyen, J.-R. Vlimant, O. Cerri, H.B. Newman, A. Periwal
et al., Interaction networks for the identification of boosted H → bb decays, Phys.
Rev. D 102 (2020) 012010 [1909.12285].
[79] A. Roy, N. Nikiforou, N. Castro and T. Andeen, Novel interpretation strategy for
searches of singly produced vectorlike quarks at the lhc, Physical Review D 101
(2020) 115027.
[80] The CMS Collaboration, Observation of a new boson with mass near 125 gev in pp
√
collisions at s = 7 and 8 tev, Journal of High Energy Physics 2013 (2013) 1.
[81] C.W. Murphy, Class imbalance techniques for high energy physics, SciPost Phys. 7
(2019) 76.
[83] S.M. Lundberg and S.-I. Lee, A unified approach to interpreting model predictions,
Advances in neural information processing systems 30 (2017) .
[87] M.S. Schlichtkrull, N. De Cao and I. Titov, Interpreting Graph Neural Networks for
NLP With Differentiable Edge Masking, in International Conference on Learning
Representations, 2020 [2010.00577].
17
[88] D. Turvill, L. Barnby, B. Yuan and A. Zahir, A survey of interpretability of machine
learning in accelerator-based high energy physics, in 2020 IEEE/ACM International
Conference on Big Data Computing, Applications and Technologies (BDCAT), p. 77,
IEEE, 2020.
[89] F. Mokhtar, R. Kansal, D. Diaz, J. Duarte, J. Pata, M. Pierini et al., Explaining
machine-learned particle-flow reconstruction, in 4th Machine Learning and the
Physical Sciences Workshop at the 35th Conference on Neural Information
Processing Systems, 2021 [2111.12840].
[90] L. Lyons, Bayes and frequentism: a particle physicist’s perspective, Contemporary
Physics 54 (2013) 1.
[91] “2021 CERN-Fermilab HCP Summer School .”
https://indico.cern.ch/event/1023573/.
[92] “Theoretical Advanced Study Institute Summer School 2018 ”Theory in an Era of
Data”.” https://sites.google.com/a/colorado.edu/tasi-2018-wiki/.
[93] R.J. Barlow, Practical statistics for particle physics, arXiv preprint
arXiv:1905.12362 (2019) .
[94] G. Cowan, “Statistics for Particle Physicists.”
https://cds.cern.ch/record/2773595, 2021.
[95] O. Behnke, K. Kröninger, G. Schott and T. Schörner-Sadenius, Data analysis in high
energy physics: a practical guide to statistical methods, John Wiley & Sons (2013).
[96] L. Lista, Statistical methods for data analysis in particle physics, vol. 941, Springer
(2017).
[97] “FAIR4HEP: Findable, Accessible, Interoperable, and Reusable Frameworks for
Physics-Inspired Artificial Intelligence in High Energy Physics.”
https://fair4hep.github.io/.
[98] M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak
et al., The FAIR guiding principles for scientific data management and stewardship,
Sci. Data 3 (2016) 1.
[99] A.-L. Lamprecht, L. Garcia, M. Kuzak, C. Martinez, R. Arcila, E. Martin Del Pico
et al., Towards FAIR principles for research software, Data Sci. J. 3 (2020) 37.
[100] D.S. Katz, F. Psomopoulos and L. Castro, Working towards understanding the role
of FAIR for machine learning, DaMaLOS@ ISWC (2021) 1.
[101] Y. Chen, E. Huerta, J. Duarte, P. Harris, D.S. Katz, M.S. Neubauer et al., A fair
and ai-ready higgs boson decay dataset, Sci. Data 9 (2022) 1 [2108.02214].
[102] S. Samuel, F. Löffler and B. König-Ries, Machine learning pipelines: provenance,
reproducibility and fair data principles, in Provenance and Annotation of Data and
Processes, p. 226, Springer (2020).
18
[103] V. Acquaviva, Teaching machine learning for the physical sciences: A summary of
lessons learned and challenges, in Teaching ML workshop at the European
Conference of Machine Learning 2021, 2021 [2108.08313].
19