Understanding Hackers Work An Empirical Study of
Understanding Hackers Work An Empirical Study of
Offensive security-tests are a common way to pro-actively discover sessments, also known as penetration tests (pen-tests), to identify
potential vulnerabilities. They are performed by specialists, often vulnerabilities and remediate them before they are discovered and
called penetration-testers or white-hat hackers. The chronic lack exploited by malicious actors. This approach is limited by the avail-
of available white-hat hackers prevents sufficient security test cov- ability of skilled offensive security professionals. While this situa-
erage of software. Research into automation tries to alleviate this tion should be remediated through increased enrollment in IT se-
problem by improving the efficiency of security testing. To achieve curity educational programs, improving the efficiency of the pene-
this, researchers and tool builders need a solid understanding of tration testers through tooling is an equally important measure. To
how hackers work, their assumptions, and pain points. accomplish this, research and tooling should be well-aligned with
In this paper, we present a first data-driven exploratory quali- security professionals’ activities and needs.
tative study of twelve security professionals, their work and prob- However, to the best of our knowledge, there has been no empir-
lems occurring therein. We perform a thematic analysis to gain in- ical research into what type of security assessments are performed,
sights into the execution of security assignments, hackers’ thought what actions are regularly performed within those, or how profes-
processes and encountered challenges. sionals select attacks to be run against their targets. Without this,
This analysis allows us to conclude with recommendations for developments might be swift but potentially misguided, and thus
researchers and tool builders to increase the efficiency of their au- eventually irrelevant.
tomation and identify novel areas for research.
Research Questions & Structure of this Work. We used three
research questions to drive the development of this work; the ap-
KEYWORDS plied research method is described in the Methodology section.
software testing, offensive security testing, ethical hacking Our first research question was “What do common security
tests look like?” We present the gathered information in section
ACM Reference Format:
Performing Security Tests, detailing different types of assign-
Andreas Happe and Jürgen Cito. 2023. Understanding Hackers’ Work: An
ments, their particularities, common actions performed during as-
Empirical Study of Offensive Security Practitioners. In Proceedings of the
31st ACM Joint European Software Engineering Conference and Symposium signments, and the role of automation.
on the Foundations of Software Engineering (ESEC/FSE ’23), December 3–9, The second research question “How do Hackers perform their
work?” focused on the inner world of our participants. Education
2023, San Francisco, CA, USA. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3611643.3613900
is an important part of socialization, therefore, results about this
aspect is included in section Becoming A Hacker. In Section How
1 INTRODUCTION do Hackers think? we present recurring themes detected during
For convenience and efficiency reasons, more and more devices our analysis. We focus on thought processes during assignment ex-
are being connected and thus exposed to public networks. While ecution, target and attack selection, dealing with uncertainty, and
beneficial, this has a dark undercurrent: the respective system’s at- internal quality assurance.
tack surface is increased and could be exploited by malicious actors. The Discussions and Implications Section is the response to
In a perfect world, all created software would be free from faults. the final “What tedious or time-consuming areas could be
As recent [19, 20], and not so recent [29], news implies, we are improved?” question. We grouped the identified research and de-
sadly not there yet. While secure software development, enabled velopment opportunities according to our target audience of re-
by defensive security testing [44, 52, 57, 58], is the long-term goal, searchers and tool builders.
short-term interventions are needed. In addition, there is an ever-
increasing abundance of legacy software whose security needs to 2 RELATED WORK
While there has been ample research on secure software develop-
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA
ment and defensive security testing [44, 52, 57, 58], the focus of our
© 2023 Copyright held by the owner/author(s). study is offensive security testing. To the best of our knowledge,
This is the author’s version of the work. It is posted here for your personal use. Not this is the first work that focuses on how hackers work, i.e., the
for redistribution. The definitive Version of Record was published in Proceedings of
the 31st ACM Joint European Software Engineering Conference and Symposium on the
Foundations of Software Engineering (ESEC/FSE ’23), December 3–9, 2023, San Francisco,
1 Pen-tests
are not the only approach. Another pro-active approach, involved during
CA, USA, https://doi.org/10.1145/3611643.3613900.
software conception and development, has become known as “left-shifting” security.
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA Andreas Happe and Jürgen Cito
context within which a security professional moves and the pro- Table 1: Participants
cesses that influence their decisions during security assignments.
Huaman et al. [39] performed a large-scale interview study of Participant Primary Secondary
German small-to-medium enterprises (SMEs). While SMEs are mak-
Participant 1 web infrastructure, iso27001
ing up a third of Germany’s GDP, they often lack resources for
establishing an effctive cyber-security posture. It analyzes their Participant 2 web infrastructure, mobile
preconception with regards to cybercrime, their adoption of secu- Participant 3 red-team AD, OT, web
Participant 4 web social engineering
rity measures and their experiences with attacks. In contrast to
our study, this focuses upon the potential “victims”, not upon se- Participant 5 red-teaming, IoT OT, web, social engineering
curity operators. One interesting finding was that 45.1% of inter- Participant 6 web AD, social engineering
Participant 7 infrastructure web, tool development
viewed companies had a cybersecurity incident warranting man-
ual response in the preceding 12 months — further highlighting Participant 8 web infrastructure
the need for trained personnel. Participant 9 infrastructure AD
Participant 10 red-teaming, AD
Smith, Theisen and Barik [49] describe Red Teams working at
Microsoft. They cover a wide range of topics including how cor- Participant 11 OT, IoT web
porate culture and red teaming interact. They also lightly touched Participant 12 web
on how people became security professionals and the interactions
in their daily work. Its interviewees were recruited from within
Microsoft, a single large-scale company and thus might not reflect The PhD thesis “How Hackers Think” [51] is a high-level trea-
industry which, cf. the previously mentioned paper, consists of to tise on hacker history, culture and their thought processes. It identi-
a large part of SMEs. In contrast, this publication focuses upon fies multiple characteristics of hackers, e.g., being highly self-motivated
the execution of security assignments, highlights hacker’s thought and curious, being able to tolerate ambiguity, and their use of men-
processes and details challenges in academic and automation re- tal models and patterning. Its focus lies on a high conceptual level
search. Furthermore, this paper is not limited to the discipline of and does not analyze how hackers actually identify and chose vul-
red-teaming. nerabilities to test. Neither does the study identify how different
Van den Hout [54] investigated the impact of different penetra- areas of penetration-testing, e.g., OT or red-teaming, might impact
tion test methodologies on the quality of the tests performed, but a hacker’s mindset.
concluded that only one reviewed methodology had widespread
adoption, but its recommendations for a structured approach were 3 METHODOLOGY
not taken into account. This could indicate a gap between “real” In general, our research follows a pragmatist approach [42, 46]
penetration testing and codified methodologies. combining methods from the empiricist and summarist interpretist
Multiple papers describe aspects of penetration-testing without traditions [33]. We used semi-structured interviews to gather in-
focusing on the operator’s mindset or their decision processes. Mu- sights into hackers’ work and thought processes.
naiah et al. [43] analyze event datasets and manually map attack Ethical Considerations. Our institution does not have a for-
patterns to MITRE ATT&CK Enterprise. This is used to show a- mal IRB process but offers voluntary submission to a Pilot Research
posteriori attack patterns but does not analyze how hackers se- Ethics Committee. As human interviews were conducted, the com-
lect the attacks to execute. MITRE ATT&CK itself is a taxonomy mittee was consulted, and topics were discussed, including ethi-
of TTPs (Tactics, Techniques and Procedures) and not a full attack cally relevant methodological clarifications, more specifically ques-
methodology. Bhuiyan et al. [23] uses GitHub security bug reports tions related to the involvement of voluntary participants in the
to identify the origins of bug reports. Examples of these origins research, as well as mitigating the risk of contextual identification.
are software source code, software log files, binary files, etc. This Participants gave their informed consent before the interviews took
details what data are used during reporting, but does not explain place; all data collected were anonymized by researchers prior to
how a security professional identifies potential vulnerabilities for analysis. All data storage and processing complied with strict na-
research in the first place, e.g., why a security professional analyzes tional privacy regulations and the EU’s General Data Protection
a mentioned log file for relevant security information. Regulation (GDPR).
Other papers focus upon narrow sub-disciplines of hacking which Recruitment. We define the target population as offensive-security
cannot be projected upon the hacking industry at large. Ceccato practitioners that work directly with customer systems. Previous
et al. [28] describes how “hackers” perform attacks against pro- research has found that security professionals are reluctant to com-
tected software, i.e., how software protection mechanisms in pro- municate with outsiders [41], especially when it comes to their
vided binary files are analyzed through reverse engineering. Based methodology and techniques. To counteract this, researchers reached
upon the responses of our interview series, reverse-engineering is out to public figures: the initial seed was populated by contacting
not representative for activities performed by offensive operators security companies, finalists of public security challenges, and se-
at large (a single mention of reverse engineering by an interview curity conference participants. We use snowball sampling to im-
partner well-renowed for publicly disclosed high-impact vulner- prove the interview pool: At the end of each interview, we asked
abilities, mentioned that he is leaving reversing due to time and the current interviewee to connect us with other offensive security
requirement constraints). professionals. In addition, we cold-called both a hacking education
youtuber and a public hacking collective that is well known for
Understanding Hacker’s Work ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA
publishing vulnerability disclosures. Both were mentioned by the themes are reviewed, defined, and named. The results of the find-
participants during the interviews, both did not react to the contact ings are presented in Section 4–6.
attempt further enforcing the idea of a close-knit community [41]. Threats to Validity. Any interview-based study faces the threat
We sampled new interview participants until theoretical satura- of selection bias (internal threat). To counteract this, we performed
tion was reached, that is, no new information was obtained during snowball sampling, recruited random security professionals dur-
the interviews. When considering theoretical saturation we differ- ing security conferences, and explicitly invited security profession-
entiated between common themes and themes specific to the inter- als from different disciplines.
viewee’s specialty area. We continued interviews until neither two For ethical reasons, interview participation was limited to white-
subsequent interviews contributed new specialty area information, hat hackers, i.e., ethical hackers (internal threat). According to prior
nor three subsequent interviews contributed new common themes. analysis, the activities of black-hat hackers, e.g., Ransomware groups,
Theoretical saturation was reached after the 12th interview which can be seen as a subset of the activities performed by ethical red-
fit recommendations [31, 34]. teams [3, 4] which are covered in this work.
Participants. Participants had to work primarily in an offen- Another potential bias would be experimenter bias (internal threat).
sive security field, we excluded participants that primarily worked To reduce the risk, all the data collected was analyzed separately
within social engineering or physical security. If participants were by the different authors, and their respective labeling results were
working in a hybrid field, such as reverse-engineering or source- compared for differences, ambiguities were discussed and resolved.
code analysis, their primary focus had to be offensive. Hacking contains multiple disciplines. Our results might only
We reached out to offensive security professionals with at least capture common themes of a subset of those (external validity). We
four years of experience in the IT security field. try to counteract this by inviting interviewees from various hack-
To our dismay, we were not able to recruit any offensive security ing fields, as is reflected in Table 1.
professionals that identified as nonmale. While we come from a cul- The geographical distribution covered roughly Central Europe.
ture that naively prides itself to blind meritocracy [24], we found Other geographic regions might be more advanced when it comes
this contradiction disturbing. As we did not deem it relevant, we to the utilization of the different types of security assignment. If
did not ask about our participants’ religious or cultural affectivities, so, the relative importance of red-teaming would be diminished
but in hindsight, we can assume diversity in that area. within this study while the techniques, processes, and problems
To protect the anonymity of the participants, we cannot detail described would remain valid.
their employment status, ethnicity, work experience before secu-
rity work, and time of employment within the security field, etc.
When excluding education and CTF-participation, participants had 4 BECOMING A HACKER
an average work experience of 9 years (𝜇 = 9.0, 𝜎 = 6.5, 𝑚𝑒𝑑𝑖𝑎𝑛 = The interview responses reveal several interesting themes regard-
8). ing the path to becoming a hacker.
Interview Protocol. Interviews were conducted as semi-structured Academic Education. All but one participant attended at least
interviews utilizing video conferencing software. All but two inter- a single university-level class. Nine completed bachelor’s degree
viewees enabled both video and audio transmission. The average studies in IT, of those, all continued to add a master’s level degree.
duration of the interview was 55 minutes. Before the main inter- The percentage of interviewees enrolled in IT security specific pro-
view started, the participants were informed about data processing, grams increased from 55% (𝑛 = 5) for bachelor’s studies to 78%
and their rights, and asked for their informed consent. (𝑛 = 7) for master’s studies, indicating that participants felt drawn
The interviews were opened with questions about the intervie- toward IT security. This fits the perceived lack of IT-Security and
wee’s job description and how they acquired the needed skill set. Secure Development lectures during non IT-security centric pro-
Those were followed up by talking about the types of security as- grams, which was partially addressed by attending CTFs or en-
signments the participants are involved with. For 1–3 of these ar- rolling for non-mandatory security classes. Classes were often taken
eas, detailed questions about particularities, procedures, automa- in an extra occupational capacity. All fitting a common theme of
tion, and problems were asked; since the questions were open- “fascination with IT security” combined with high intrinsic motiva-
ended, the interviews branched out to subtopics organically from tion.
there. The interviews were closed with questions about grievances Experience before IT-Security. Having 2–3 years of non-security
and additional thoughts related to the field of IT security. IT exposure before entering the IT security field was found to
We recorded and manually transcribed all interviews. During be advantageous. Another related recommendation was to have a
the transcription, sensitive data was scrubbed from the interview; broad IT security base combined with one or two specialization ar-
the transcribed interview was then submitted for confirmation to eas. Within our group of interviewees, the common base was web
the interviewee. Scrubbed interviews were loaded into delve [5] for security or internal network assessments; examples of specializa-
thematic analysis. tions were red teaming or cloud-specific knowledge.
Analysis. Reflexive Thematic Analysis [26] was chosen to per- Staying relevant. All interviewees perceived a need for ongo-
form a data-driven exploratory analysis of interview transcriptions. ing education. The ubiquitous information source was Twitter, fol-
In summary, when performing thematic analysis, the researchers lowed by other online services such as YouTube channels, blog
initially familiarize themselves with the data, and extracts of the posts, Reddit, Github, or paid-for online courses. In the physical
data are tagged with codes. These codes are then used to create world, colleagues and conferences were mentioned. The quality of
clusters that identify or construct underlying themes. Then, those online material was considered high, although an interviewee had
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA Andreas Happe and Jürgen Cito
Table 2: Types of Security Assessments custom-written software where no prior vulnerabilities are pub-
lished in vulnerability databases. As the scope is tight, customers
Type Covert Team-Size Effort in Days commonly provide dedicated test environments against which de-
Vulnerability Assessment not typical 1 2-4 structive tests can be performed. Another benefit of the limited
Penetration Test optional 1-2 5-10 scope is that the execution of a penetration test can be highly
Internal Network Test optional 1-2 7-10 structured, some (𝑛 = 2) interviewees went as far as calling them
IoT Test optional 1-2 7-10 “catalog-based”. Pen-tests are primarily performed manually.
OT Test never 1-2 7-10 Internal Network Penetration Tests verify the security and
Red-Teaming always 3-4 30+ resilience of internal networks. Their basic assumption is “assumed
breach”, i.e., the adversary is already within the local network and
now attempts to gain sensitive data or achieve higher privileges
— emulating Ransomware scenarios which have recently scourged
qualms publishing information due to potential misuse. A single companies. Microsoft Active Directory (AD) is ubiquitous in corpo-
participant regularly used the Darknet as news source. rate networks; thus, if present, it is the main target. In these cases,
To CTF or not. CTF attendance was a common theme. Partic- the security assignment’s intent is to obtain domain administrator
ipants saw a bidirectional information transfer: skills learned in privileges. The focus lies on exploiting known vulnerabilities, prod-
CTFs were applicable at work and vice versa. Tasks in CTFs were uct features, mis-configurations, and insufficient access-control or
considered very targeted in that they narrowly target a vulnerabil- hardening measures. Another big aspect is Lateral Movement, i.e.,
ity, and solving the challenge or reading a write-up were consid- using compromised systems to pivot to new targets. Assignments
ered efficient ways of gathering knowledge about the respective are made against productive environments.
vulnerability. Specialized security practitioners, e.g., from the OT IoT Tests are often performed as product tests and target con-
or ICS area, found CTFs to be introductory and shallow. crete products such as Smart-Home or smart medical devices. The
scope of typical IoT pen-tests is broadened by the inclusion of
5 HOW DO HACKERS WORK? hardware-specific tests, as well as testing if regulatory safety re-
While we encountered the common muttering of “every projects quirements are upheld during attacks. IoT pen-tests often include
is different”, these sections identify types of penetration tests, each tests of connected web- or mobile applications. Together with the
with distinct requirements, strategies, and particular actions. When hardware angle, this makes for a broader scope compared to a Web-
looking at a pen-tester’s work, this is the external view, i.e., how a , Infrastructure-, or Mobile Penetration Test.
pen-tester’s work is perceived from the outside. OT Tests target Operational Technology (OT) such as SCADA,
ICS or utility networks. They can be differentiated into product
tests and in-situ network tests of already configured systems. As
5.1 Types of Security Tests and their solutions consist of off-the-shelf software that is highly customized
Differences for usage within the corresponding client network, the latter are
Although different assignments have a similar project organiza- often preferred by the customer. Tested subjects often use propri-
tion, their execution differs due to the respective client and tar- etary protocols; therefore, reverse engineering is a common prac-
get environment. Table 2 shows the main types of security assign- tice in OT tests.
ments encountered during interviews. OT facilities, e.g., power plants, are expensive and often hard
Vulnerability Assessments focus upon achieving a high cov- to come by, thus a dedicated testing environment is rarely avail-
erage of the targeted assets, which are typically external IP-ranges able. Testing commonly occurs during scheduled down-times; this
(including web servers) or internal networks (including clients and severely impacts the available test window. Another related par-
internal infrastructure). Enumerating targets, e.g., through web crawl- ticularity: availability often trumps the breadth or depth of per-
ing or network scans, leads to the creation of important inventory formed security tests. As test subjects are “connected to the real
databases. Those are subsequently used to test against known vul- world”, negative side effects are potentially catastrophic. Security
nerability databases, known configuration errors or generic vul- tests are therefore highly coordinated with customers to prevent
nerability classes such as SQL injections. As assignments typically any negative fallout. This often prohibits any covert action.
include large amounts of potential targets, a high level of automa- Regulatory requirements [21] lead to a convergence between
tion is necessary. IoT and OT devices. In addition, Microsoft Active Directory starts
Web-, Infrastructure- or Mobile- Penetration Tests share to encreep OT networks, thus creating an overlap with Internal
similarities with vulnerability assessments. The demarcation point Network Tests.
between those two varied between interviewees. The situation is Compared to other approaches, in Red-Teaming the attack-
further complicated as vulnerability scans are often used as an ini- ers have a concrete mission, e.g., gain access to a defined subset
tial step during pen-testing. Generally speaking, while vulnerabil- of computers or a source code repository. While in Internal Net-
ity assessments focus on breadth, pen-testing focuses on depth, i.e., work Penetration Tests gaining Domain Admin is often the final
thoroughly breaking a single target. Pen-Tests are within the realm goal, this is only a means for achieving the mission during Red-
of application security: in addition to well-known vulnerabilities Teaming. Attackers holistically target a company and employ addi-
or configuration errors, new vulnerabilities are hunted within the tional techniques such as Open Source Intelligence (OSINT), Persis-
software under test. Penetration tests are often performed against tence, Command&Control (C2) and Social Engineering; Post-Exploitation
Understanding Hacker’s Work ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA
is more prominent compared to other disciplines. Red teaming is manual crawling, was integrated as additional input into the tested
not concerned with broad coverage, but with achieving the team’s steps.
well-defined objective. Red-Teaming does not only attack the tar- According to interviewees, most time and effort are spent upon
get’s technical security posture but also the response of the blue authorization tests. An application typically has multiple user groups
team, i.e., defenders. Thus covert operations, hidden persistence, with different access rights. During testing, penetration testers re-
command&control systems and evasion of defensive techniques quest one or more users per existing group and try to perform
enter the picture. unauthorized data access with one user using data of another user.
Assignments are often performed in larger teams and over ex- To verify responses, testers need documentation about the imple-
tensive time frames, making information transfer between partici- mented access groups. If none was given, interviewees approxi-
pants more important. Adding additional team members to speed mate a model of the access rules through probing/testing and ex-
up an ongoing operation is problematic as the new team members perience.
do not share the existing member’s target system knowledge. With the exception of testing for authentication or authoriza-
tion, automated testing was deemed well-established and automated
5.2 Black- vs. Gray-Box Security Testing tooling was commonly employed. Common injection attack vec-
tors were well covered by tooling, for example sqlmap [18] for
When it comes to test execution, an important distinction is the
testing for SQL injections. Multiple Web-Application-Testers “com-
amount of information and support provided by the customer. Dur-
plained” that typical injection-based attacks which were common
ing black-box tests, practitioners go in “blind”; no information ex-
10 years ago are now seldom seen and are rather used for illus-
cept the scope is given. During white-box tests, full system access
trative purposes during education. Their suspected “culprit” is the
or even the source-code of the tested application is given. Gray-box
rise of web application frameworks with sane defaults that auto-
tests lie in-between: often access credentials or system architecture
matically prevent many attack classes. Multiple interviewees con-
descriptions are provided before testing commences.
sidered switching their area of interest due to this development.
Pure white-box tests, as in “source-code reviews”, are rarely per-
Multiple interviewees described API-based tests as tedious. Typ-
formed due to their prohibitive costs. The type of assignment is
ically an API test is performed by calling a sequence of opera-
also of importance: red-teaming is almost always performed as a
tions. Each operation is detailed through an API specification pro-
black-box test as the target’s personnel is not involved beneficially.
vided by the customer, e.g., through OpenAPI/Swagger or WSDL
OT tests are often performed in tight lock-step with customers
files. In theory, directly testing the back-end API reduces the pen-
(to reduce the potential fallout) and thus are gray-boxed. Intervie-
testing overhead as the tester can focus upon the core function-
wees overwhelmingly recommended moving from black-box
ality; in practice, API tests become time-consuming due to a lack
towards white-box testing. The reasons given were time and
of documentation with sufficient quality. API documentation only
thus cost efficiency, as well as potential for improved test cover-
describes single operations, often lacking detailed descriptions of
age.
valid input formats and their semantics. In addition, to achieve
In other areas, customers are helping pen testers to improve effi-
good test coverage, test cases need to perform a sequence of causally
ciency too. “Assumed breach” scenarios in Internal Network Pene-
dependent API calls, potentially reusing and refining data between
tration Testing conceptually assume that a client computer will be
operations. While performing a traditional web application pen
breached eventually and thus use a breached computer as a starting
test, this causality and examples of input data can be derived from
point for investigations. During web pen tests or during external
the captured web traffic. When performing API tests, these have
scans, rate limits or firewalls are commonly disabled to allow swift
to be derived from the API specifications or, more realistically, by
pen test execution. During web application pen-tests, internal de-
pestering the customer’s liaison contact.
tails, such as used technologies, are commonly provided to reduce
Internal Network Tests often occur in phases which are or-
the search space.
dered from “quiet” to “loud” when it comes to visibility. A typical
assignment targeting a Microsoft Active Directory might include
5.3 Typical Testing Workflows the following phases: initially, only network access is granted. The
Participants were asked to detail the execution of the different attacker either sniffs the network for exposed access credentials
types of assignments. This section describes the peculiarities of or utilizes MitM- and spoofing attacks to gain user credentials or
the different areas. tokens. In addition, anonymously accessible network shares are
Activities performed during Web Penetration Tests can be investigated for “juicy” information such as user or admin creden-
separated into exploratory intuitive testing and exhaustive testing tials. Exploits are used against vulnerable network services if the
against checklists or standards. All interviewees utilized both, no risks of detection and to stability are deemed acceptable. In the sec-
specific ordering between those two was detected, although if the ond phase, an attacker has either already gathered user credentials
checklist-verification was automated, it often was run in parallel to or has been provided with those by the customer. These credentials
exploratory testing. If a high-level of automation is achieved, the are typically for non-privileged domain users, and attackers utilize
manual (exploratory) testing can be integrated into the automation: them to further enumerate shares, gain access to additional domain
one interviewee detailed a multi-stage automated test-setup con- accounts or computers, or to gain local administrative privileges.
taining multiple enumeration steps, e.g., website crawling, where Lateral Movement often incurs during this phase. In the next phase,
the result of each step was manually verified, rectified and used the attacker has either gained or is provided local administrative
to instrument subsequent automated steps. Manual testing, e.g.,
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA Andreas Happe and Jürgen Cito
privileges and tries to perform further Lateral Movement until a do- Table 3: Commonly Named Tools. # denotes the interviewee
main administrative account is compromised. With that, the whole count.
network is owned.
Please note, that phases do not follow a traditional waterfall Tool Area Availability #
model. According to interviewees (𝑛 = 2), often the domain ad- PortSwigger BURP Suite [12] Web-Testing free, commercial 7
min credentials can be gathered during the initial phase. This is BloodHound [2] AD Enumeration OSS 5
then noted, and additional attacks are performed until the agreed SQLMap.py [18] Web/SQLi OSS 3
upon timebox is reached. nmap [14] Network OSS 7
Many automated attacks, e.g., EternalBlue [25] or certify [7], nessus [13] Network commercial 8
were described as “too loud” or “unstable” for use during the ini- gobuster [8], dirbuster [6] Network OSS 2
certify [7] AD Exploitation OSS 4
tial phases. Another automation topic was the identification of metasploit [11] Exploitation OSS 3
“juicy” files within network shares: this activity is performed pri- nuclei [15] Exploitation OSS 3
marily manually as the identified data are context specific. In addi-
tion, creating a full-copy of a network share is time- and network-
sensitive as well as easily detectable and countermeasure systems
using honey-tokens are beginning to be deployed at customers’
sites. Due to the potentially catastrophic side-effects of testing, a risk-
IoT-Tests contain a wide range of potential targets. This in- based approach is often applied: together with the customer a threat
cludes both the tested IoT hardware, which can range from con- model workshop can be performed and potential scenarios that
nected toys to life-sustaining IoT devices, as well as the wider ecosys- warrant testing identified. Those scenarios, and only those, are sub-
tem where IoT devices communicate with cloud infrastructure, web sequently manually executed against the OT system. As the avail-
applications or mobile applications. During the initial enumeration able amount of time is fixed, threat modeling and performing the
phase, a model of the different communication channels as well derived tests compete for the same temporal resources.
as their responsibilities is often created. During hardware analysis,
optical inspection is used to identify used processing, memory and
storage components as well as their interaction and potential uses. 5.4 Automation
All interviewees in the IoT area mentioned applying industrial All interviewees used pre-made tooling, while few (𝑛 = 3) wrote
standards as well as the usage of checklists that included the OWASP additional tooling on their own. Overall, the tooling situation for
IoT [36] and OWASP Firmware Testing guides [35]. specific testing areas was seen in a positive light. In contrast, “all-
Red-Teaming is special due to its evasion- and deception-based in-one” tools were seen in a negative light. Multiple interviewees
methods as well as through its objective-based approach. A red remarked that a “fully automated tool cannot replace a pen-tester”
team initially has knowledge of its objective, e.g., gain access to a or, as one interviewee cynically replied, “yeah, I want a tool where
special server in department X, as well as a broad allowed scope, I can click a button and magically I get a finished pen-test report”.
e.g., the targeted company. Teams initially model how to breach the Practitioners relied on multiple small tools for different areas, e.g.,
company, e.g., by identifying potential social engineering victims. gobuster [8] for content discovery or sqlmap [18] for testing SQL
After the breach, low-key enumeration is used to covertly model injections. PortSwigger’s BURP Proxy Suite [12] was used by every
“how a company works” and then abuse that knowledge to derive at- web application pen-tester interviewed. See Table 3 for a list of
tacks that mirror expected traffic and behavior patterns. Through- commonly named automated tools.
out a red-teaming campaign, a map of known or breached elements Problems with tooling. Interviewees remarked that the setup
is built and compared to the imagined map of the company that in- overhead of automation tools can be problematic. Especially for
cluding the final objective: if both converge, the objective should short-term projects, such as vulnerability assessments or tightly-
be achieved. timed web application pen-tests, the initial setup overhead and
Automation employed for network lateral movement or breach- processing time can be prohibitive for deploying tooling. Another
ing web applications originate from the other pen-testing disci- problem was coverage: even within the same problem area, the cov-
plines, but have to be re-evaluated against their chance of being erage of different tools widely diverges, and the situation is made
detected. As red-team assignments are performed against real and worse as commonly no tool provides full coverage of a testing area.
live systems, the scope of destructive operations might be limited. To counteract this, practitioners commonly use multiple tools re-
OT-Tests have their own challenges. Due to the prevalence of dundantly, yielding more processing time overhead and needing
proprietary protocols, time-consuming reverse engineering of those manual merging of the different tools’ results.
protocols must initially occur. Mentioned experiences of our inter- Some areas were described as not suitable for automation. As
viewees indicate that Security-by-Obscurity is still common within OT systems are finicky and the potential fallout catastrophic, auto-
this area; this would match the perceived resistance of some Indus- mated tests are often not feasible. Additionally, when performing
trial Control System (ICS) suppliers when faced with responsible social engineering during red-team assignments, fully automated
disclosure requests. Due to the time burden of reverse-engineering, tools are avoided for both fear of detection and ethical qualms be-
it frequently has to be aborted due to the timeboxed nature of test- cause they would be used on human targets.
ing. Extendability and Community was identified as an impor-
tant discriminator by practitioners. Both are related to fast-paced
Understanding Hacker’s Work ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA
developments within the exploit community: if a tool can be proac- These two categories are fluid. For example, findings from “hunt-
tively extended or scripted by the community, it and its imple- ing for bugs”, i.e., a new 0-day exploit against a software, can end
mented methods can evolve faster compared to reactive develop- up within “searching for known vulnerabilities”, i.e., when a rule for
ment within walled gardens. An example of an OSS tool utilizing detecting 0-day is added to a web vulnerability scanner.
community-provided detection rules is nuclei [15]; an example of While not stated explicitly during the interviews, we assume
a commercial tool with good OSS extendability is the PortSwigger that our interviewee’s mental model is primed through their un-
BURP Proxy Suite [12] with its integrated BApp Store. derstanding of this divide, and highly impacts tool and technique
Manual fine-tuning to reduce search space. Multiple inter- selection. As an interviewee mentioned, “you don’t hunt for 0-days
viewees mentioned that they are adjusting the tooling according to during an Active Directory assignment”. This implies that pen-testers
their ongoing findings. Examples of this feedback loop would be will not consider spending days fuzzing a domain controller for
limiting tested vulnerability classes to feasible ones, e.g., not test- new vulnerabilities during an internal network scans.
ing a static website for SQL injections, or limiting tested database
queries to concrete database dialects.
6.2 Identifying Vulnerable Areas or Operations
Participants often described exploratory testing during which they
6 HOW DO HACKERS THINK? were guided by intuition. Through follow-up questions, further in-
While Section 5 describes the externally visible different types and formation about this intuition was gathered.
activities performed during security testing, this section focuses on All interviewees were analyzing requests and responses; the for-
the inner workings and thoughts of security professionals during mer for conspicuous parameters and the latter for occurrences of
security testing, detailing their decision processes and potential error messages or other suspicious behavior, that is, behavior that
sources of their intrinsic motivation. does not fulfill the testers’ expectations.
During the interviews, multiple areas were identified where se-
curity testers possessed a mental model of the expected behavior of
6.1 Exploiting Configuration vs. Applications the software-under-test; during testing security testers were trying
A reoccurring theme was the distinction between searching for to find operations that could trigger unexpected behavior which, in
known vulnerabilities and hunting for new vulnerabilities. Exam- turn, might turn into a security vulnerability. Those mental mod-
ples of the former would be executing a vulnerability scan against els were built from experience, e.g., prior assignments or experi-
off-the-shelf software, or investigating a Microsoft Active Direc- ence within the specific business area, as well as adapted during
tory for misconfigurations. An example of the latter would be search- the security test itself, e.g., “learning how the application works”. A
ing for unknown SQL injection vulnerabilities within a custom summarization of multiple observed mental models can be seen in
written web application or discovering a new vulnerability class. Table 5.
Synonyms given for “searching for known vulnerabilities vs. hunt- Pen-testers attributed their intuition to experience which could
ing for new vulnerabilities” were “vulnerability assessments vs. ap- be built from performed penetration tests, participation in CTF
plication security” or “hacking configuration vs. hacking programs”. events, prior engagements with the same client or industry area,
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA Andreas Happe and Jürgen Cito
or by implementing similar software solutions during their former Concrete examples of uncertainty would be a pen-tester issuing
life as software developers. Participants remarked that during test- a HTTP request where they expect an “access denied” response but
ing, they are triggered by vulnerabilities or exploits they had re- instead receiving a successful response containing data which can-
cently read about and, in response, would start additional research. not be clearly classified as belonging to the current user or not. An-
One penetration tester explicitly mentioned creating a topic map other example would be testing for time-based blind SQL injection
during everyday research which they then refer back to during as- vulnerabilities where the measured latency is not sufficiently de-
signments. terministic for verifying the vulnerability. Similarly, second-order
Related to experience, practitioners had preconceptions about attacks cannot easily be pinned on the invoking background pro-
the technologies used or features implemented. Some functional- cess.
ity, e.g., file uploads or XML processing, were thought to be hard Penetration testers modify existing valid requests to include ma-
to implement in a secure manner — to quote a participant, “there licious payloads. When these requests produce errors, the reason
are some things that just cannot be implemented correctly”. Similar can be uncertain: was it a potential vulnerability? A successful in-
resentments were discovered about used technologies. Some pro- put filtering algorithm? Or an application error that cannot be ex-
gramming languages were deemed to increase the probability of an ploited? This classification impacts the selection of subsequent re-
application containing defects; an interviewee mentioned thinking quests and attacks.
“let’s see how developers have been fooled again” when going into as- Another instance of uncertainty occurs during tool optimiza-
signments. As cynical as it may be, PHP was often mentioned as tion: tool output is continuously used to further optimize subse-
such a technology. quent tool invocations. Interviewees performed a sanity check if
It is important to note that participants may be subject to selec- reported system fingerprints were feasible and forfeited them oth-
tion and survivor bias. They might find vulnerabilities in areas they erwise. In addition, some high-impact decisions, such as limiting
focus on, ignoring plentiful vulnerabilities in other areas they are the expectations to a single DBMS type, were verified with the
historically ignoring. After a vulnerability has been found in an client before incorporating them into tooling selection or config-
area, the increased attention upon that area often yields multiple uration.
subsequent vulnerabilities [9].
Two distinct positions were experienced regarding the learnabil-
ity of this intuition. On one side, “nobody is born a super hacker”, 6.4 Don’t waste my time
on the other hand, one interviewee mentioned that the best pene-
tration testers in their peer group exhibited hacking-style behavior One theme discovered was that interviewees feel the need to be
already during kindergarten. Debating nature-vs-nurture or art-vs- time-efficient. This might be related to tight time-budgets or very
craft would go beyond the scope of this publication. Regardless constrained test-bed availability being anathema to good test cov-
of this, common consensus was found that hacking skills are im- erage. Shortcuts were taken to reduce menial tasks. For example,
proved through practice. during internal network tests, a breach is already assumed. The in-
terviewees defended this decision through “this will eventually hap-
pen through social engineering anyways”. A similar argument was
given for being provided accounts with local administrative priv-
6.3 Dealing with Uncertainty ileges: “a real attacker can just wait for the next 0-day”, or for dis-
Pen-testers routinely have to deal with uncertainty as they lack abling Anti-Virus solutions as evading them “takes time not skill”.
transparency of the tested system: pen-testers must make assump- Tests with foregone conclusions were considered tedious, one ex-
tions about requirements, the tested system’s architecture, as well ample given was testing an Anti-Virus solution embedded within a
as about accepted input values and the corresponding expected web-application with different payloads. The repetitiveness of this
output parameters. They evaluate those against their expectations, task might contribute to this too. This aversion to responsible dis-
and if a system deviates, examine the deviation for exploitability. closure procedures might be correlated to bad experiences during
When in doubt, testers can escalate and query their clients, but this prior disclosures: the vendor’s responses were mostly “wasting”
is deemed to be time-inefficient and thus minimized. the interviewee’s time.
Understanding Hacker’s Work ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA
CVEs. Security Practitioners are more focused upon hunting con- security professionals searching for vulnerabilities as well as de-
figuration errors, exploiting well-known vulnerabilities, or identi- fensive software developers trying to prevent vulnerabilities from
fying new instances of known attack classes. They utilize informa- entering their code in the first place.
tion and tools from security researchers for that.
Tools such as fuzzers are thus more applicable to security re-
7.2.3 API Workflow Discovery for Security Test Generation.
searchers than to security practitioners. The large amount of re-
Interviewees lamented that the manual creation of API security
search into fuzzing indicates that academic research is targeting
test-cases is a tedious and time-consuming process. While the au-
security researchers rather than practitioners and thus only indi-
tomation of API test generation would be advantageous, the fol-
rectly improving the security landscape when information from
lowing gaps currently prevent this: discovery of API endpoints
security researchers trickles down to practitioners.
and operations, generation of benign requests as baseline, combin-
ing single requests into test flows using social and semantic infor-
7.2 Opportunities for Research mation, deriving malicious test cases, and finally evaluating test
We now want to answer the important final question, “What te- outcomes. The automatic generation of security test suites
dious or time-consuming areas could be improved?” through- based upon API definitions and traffic patterns would reduce
out the rest of this section and frame them as opportunities for testers’ odium for utilizing this important class of testing. While
future research that directly benefits security practitioners. there have been several works that propose approaches for API
discovery [53, 60], the kind of discovery we envision would focus
7.2.1 Automating Authorization Testing. For security tests with on maximizing coverage for security tests.
a relatively restricted scope such as web application tests, we sug-
gest research into covering additional vulnerability classes. Autho-
7.2.4 Information Discovery for Security Testing. Internal Net-
rization Testing is currently performed manually and was named
work Tests and Red-Teaming are highly dependent on discovering
one of the most time-consuming parts of testing and thus would
and utilizing client-specific information. Automated and stealthy
be a fruitful target for automation research. Current gaps are man-
information gathering from compromised systems or net-
ifold: detection of potential operations, accepted parameters, and
work shares is performed manually, and thus its efficiency could
potentially malicious parameters; generation of payloads as well as
be improved. The goal is the efficient identification of “juicy” in-
the assessment of an attack’s success. A subtle problem is the clas-
formation while reducing the number of read requests to minimize
sification of returned web pages and downloads into authorized
network impact or the chance of triggering intrusion detection sys-
and unauthorized content as this is highly context specific.
tems. Research in this area would also benefit defenders as it would
make forensic work, e.g., analyzing data breaches, more efficient.
7.2.2 Gray-box Testing. The preference for gray-box testing in
the work practice of software security professionals was surpris-
ing and can have a significant impact on software testing design: if 7.2.5 Scaling Personalized Phishing with ML. Phishing is an
security testing solutions can access configuration or the target’s important part of the red-teaming workflow and is commonly done
source code (or if the target is willing to instrumentalize the tar- manually, due to the nature of customization proper phishing re-
get software through sensors as is done in IAST) , automated quires. We see an opportunity to investigate the increase of scal-
software testing approaches using source-code or configu- ability of social engineering through machine learning tech-
ration become increasingly feasible for security testing. Further niques. To create highly effective phishing mails, currently, mails
research into potential synergy effects, as well as research into au- are manually customized to fit the respective recipient. Machine
tomated source code and configuration file analysis from a security learning techniques could automate this and thus provide Spear
perspective, is currently underexplored and ripe for investigation. Phising at Scale, as they have already been shown to personal-
Research in this area yields dual-use tools, aiding both offensive ize natural language communication in other domains [30, 40, 59].
Understanding Hacker’s Work ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA
[29] Durumeric, Z., Li, F., Kasten, J., Amann, J., Beekman, J., Payer, M., Weaver, N., [55] van der Stork, A., Glas, B., Smithline, N., and Gigler, T. Owasp top 10:2021.
Adrian, D., Paxson, V., Bailey, M., et al. The matter of heartbleed. In Proceed- https://owasp.org/Top10/0x00-notice/, Sep 2021.
ings of the 2014 conference on internet measurement conference (2014), pp. 475– [56] van der Stork, A., Grossman, J., Cuthbert, D., Lang, E., and
488. Manico, J. Owasp application security verification standard.
[30] Ferretti, S., Mirri, S., Prandi, C., and Salomoni, P. Automatic web content https://raw.githubusercontent.com/OWASP/ASVS/v4.0.3/4.0/OWASP%20Application%20Security%
personalization through reinforcement learning. Journal of Systems and Software Oct 2021.
121 (2016), 157–169. [57] Wysopal, C., Nelson, L., Dustin, E., and Dai Zovi, D. The art of software
[31] Francis, J. J., Johnston, M., Robertson, C., Glidewell, L., Entwistle, V., Ec- security testing: identifying software security flaws. Pearson Education, 2006.
cles, M. P., and Grimshaw, J. M. What is an adequate sample size? operational- [58] Wysopal, C., Nelson, L., Dustin, E., and Dai Zovi, D. The art of software
ising data saturation for theory-based interview studies. Psychology and health security testing: identifying software security flaws. Pearson Education, 2006.
25, 10 (2010), 1229–1245. [59] Xu, M., Qian, F., Mei, Q., Huang, K., and Liu, X. Deeptype: On-device deep
[32] Gascon, H., Wressnegger, C., Yamaguchi, F., Arp, D., and Rieck, K. Pul- learning for input personalization service with minimal privacy concern. Pro-
sar: Stateful black-box fuzzing of proprietary network protocols. In Security ceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
and Privacy in Communication Networks: 11th EAI International Conference, Se- 2, 4 (2018), 1–26.
cureComm 2015, Dallas, TX, USA, October 26-29, 2015, Proceedings 11 (2015), [60] Yessenov, K., Kuraj, I., and Solar-Lezama, A. Demomatch: Api discovery from
Springer, pp. 330–347. demonstrations. ACM SIGPLAN Notices 52, 6 (2017), 64–78.
[33] Guba, E. G., Lincoln, Y. S., et al. Competing paradigms in qualitative research.
Handbook of qualitative research 2, 163-194 (1994), 105.
[34] Guest, G., Bunce, A., and Johnson, L. How many interviews are enough? an
experiment with data saturation and variability. Field methods 18, 1 (2006), 59–
82.
[35] Guzman, A. Owasp firmware security testing methodology.
https://scriptingxss.gitbook.io/firmware-security-testing-methodology/.
Accessed: 2022-09-30.
[36] Guzman, A., and Bassem, C. Owasp iot security verification standard.
https://github.com/OWASP/IoT-Security-Verification-Standard-ISVS/releases/download/1.0RC/OWASP_ISVS-1.0RC-en_WIP_.pdf ,
Dec 2020.
[37] Harang, R., and Ducau, F. N. Measuring the speed of the red queen’s race.
BlackHat: Las Vegas, NV, USA (2018).
[38] Holguera, C., Müller, B., Schleier, S., and Willemsen,
J. Owasp mobile application security verification standard.
https://github.com/OWASP/owasp-masvs/releases/latest/download/OWASP_MASVS-v1.4.2-en.pdf ,
Jan 2022.
[39] Huaman, N., von Skarczinski, B., Wermke, D., Stransky, C., Acar, Y., Dreissi-
gacker, A., and Fahl, S. A large-scale interview study on information security
in and attacks against small and medium-sized enterprises. In In 30th USENIX
Security Symposium (2021).
[40] Katakis, I., Tsoumakas, G., Banos, E., Bassiliades, N., and Vlahavas, I. An
adaptive personalized news dissemination system. Journal of intelligent infor-
mation systems 32 (2009), 191–212.
[41] Kotulic, A. G., and Clark, J. G. Why there aren’t more information security
research studies. Information & Management 41, 5 (2004), 597–607.
[42] Mackenzie, N., and Knipe, S. Research dilemmas: Paradigms, methods and
methodology. Issues in educational research 16, 2 (2006), 193–205.
[43] Munaiah, N., Rahman, A., Pelletier, J., Williams, L., and Meneely, A. Char-
acterizing attacker behavior in a cybersecurity penetration testing competition.
In 2019 ACM/IEEE International Symposium on Empirical Software Engineering
and Measurement (ESEM) (2019), IEEE, pp. 1–6.
[44] Potter, B., and McGraw, G. Software security testing. IEEE Security & Privacy
2, 5 (2004), 81–85.
[45] Saad, E., and Mitchell, R. Owasp web security testing guide.
https://github.com/OWASP/wstg/releases/download/v4.2/wstg-v4.2.pdf ,
Dec 2020.
[46] Saunders, M., and Tosey, P. The layers of research design. Tech. rep., University
of Surrey, 2013.
[47] Schleier, S., Mueller, B., Holguera, C., and Willem-
sen, J. Owasp mobile application security testing guide.
https://github.com/OWASP/owasp-mastg/releases/latest/download/OWASP_MASTG-v1.5.0.pdf ,
Sep 2022.
[48] Singer, L., Figueira Filho, F., and Storey, M.-A. Software engineering at the
speed of light: how developers stay current using twitter. In Proceedings of the
36th International Conference on Software Engineering (2014), pp. 211–221.
[49] Smith, J., Theisen, C., and Barik, T. A case study of software security red teams
at microsoft. In 2020 IEEE Symposium on Visual Languages and Human-Centric
Computing (VL/HCC) (2020), IEEE, pp. 1–10.
[50] Strom, B. E., Applebaum, A., Miller, D. P., Nickels, K. C., Pennington, A. G.,
and Thomas, C. B. Mitre att&ck: Design and philosophy. In Technical report.
The MITRE Corporation, 2018.
[51] Summers, T. C. How hackers think: A mixed method study of mental models and
cognitive patterns of high-tech wizards. Case Western Reserve University, 2015.
[52] Takanen, A., Demott, J. D., Miller, C., and Kettunen, A. Fuzzing for software
security testing and quality assurance. Artech House, 2018.
[53] Torres, R., Tapia, B., et al. Improving web api discovery by leveraging social
information. In 2011 IEEE International Conference on Web Services (2011), IEEE,
pp. 744–745.
[54] van den Hout, N. J. Standardised Penetration Testing? Examining the Usefulness
of Current Penetration Testing Methodologies. PhD thesis, 09 2019.