1 Introduction

Software systems are developed to be deployed and used. Operating software in a production environment, however, entails several challenges. Among the others, it is very important to make sure that the software system behaves exactly as in a development environment. Virtualization and, above all, containerization technologies are increasingly being used to ensure that such a requirement is metFootnote 1. Among the others, DockerFootnote 2 is one of the most popular platforms used in the DevOps workflow: It is the main containerization framework in the open-source community (Cito et al. 2017), and is widely used by professional developersFootnote 3. Also, Docker is the most loved and most wanted platform in the 2021 StackOverflow survey\(^{3}\). Docker allows releasing applications together with their dependencies through containers (i.e., virtual environments) sharing the host operating system kernel. Each Docker image is defined through a Dockerfile, which contains instructions to build the image containing the application. All the public Docker images are hosted on an online repository called DockerHubFootnote 4. Since its introduction in 2013, Docker counts 3.3M of Desktop installations, and 318B image pulls from DockerHubFootnote 5.

Defining Dockerfiles, however, is far from trivial: Each application has its own dependencies and requires specific configurations for the execution environment. Previous work (Wu et al. 2020) introduced the concept of Dockerfile smells, which are violations of best practices, similarly to code smells (Becker et al. 1999), and a catalog of such problemsFootnote 6. The presence of such smells might increase the risk of build failures, generate oversized images, and security issues (Cito et al. 2017; Zhang et al. 2018; Henkel et al. 2020; Zerouali et al. 2019). Previous work studied the prevalence of Dockerfile smells (Cito et al. 2017; Lin et al. 2022; Eng and Hindle 2021).

Despite the popularity and adoption of Docker, there is still a lack of tools to support developers in improving the quality and reliability of containerized applications, e.g., tools for automatic refactoring of code smells on Dockerfiles (Ksontini et al. 2021). Relevant studies in this area investigated the prevalence of Dockerfile smells in open-source projects (Cito et al. 2017; Wu et al. 2020; Lin et al. 2022; Eng and Hindle 2021), the diffusion technical debt (Azuma et al. 2022), and the refactoring operations typically performed by developers (Ksontini et al. 2021).

While it is clear which Dockerfile smells are more frequent than others, it is still unclear which smells are more important to developers. A previous study by Eng and Hindle (2021) reported how the number of smells evolves over time. Still, there is no clear evidence showing that (i) developers actually fix Dockerfile smells (e.g., they might incidentally disappear), and that (ii) developers would be willing to fix Dockerfile smells in the first place.

In this paper, we propose a study to fill this gap. First, we analyze the survivability of Dockerfile smells to understand how developers fix them and which smells they consider relevant to remove. This, however, only tells a part of the story: Developers might not correct some smells because they are harder to fix. Therefore, we also evaluated to what extent developers are willing to accept fixes to smells when they are proposed to them (e.g., by an automated tool). The context of the study is represented by a total of  220k commits and 4,255 repositories, extracted from a state-of-the-art dataset containing the change history of about 9.4M unique Dockerfiles.

For each instance of such a dataset (which is a Dockerfile snapshot), we extracted the list of Dockerfile smells using the hadolint tool (Hadolint 2022). The tool performs a rule check on a parsed Abstract Syntax Tree (AST) representation of the input Dockerfile, based on the Docker (Best practices for writing Dockerfiles 2022) and shell script (ShellCheck 2022) best practices. Next, we manually validate a total of 1,000 commits that make one or more smells disappear to verify (i) that they are real fixes (e.g., the smell was not removed incidentally), (ii) whether the fix is informed (e.g., if developers explicitly mention such an operation in the commit message), and (iii) remove possible false positives identified by hadolint.

Then, we evaluated to what extent developers are willing to accept changes aimed at fixing smells. To this aim, we defined Dockleaner, a rule-based refactoring tool that automatically fixes the 12 most frequent Dockerfile smells. We used Dockleaner to fix a set of smelly Dockerfiles extracted from the most active repositories. Next, we submitted a total of 157 pull requests to developers containing the fixes, one for each repository. We monitored the status of the pull requests for more than 7 months (i.e., 218 days). In the end, we evaluated how many of them get accepted for each smell type and the developers’ reactions. The results show that, mostly, smells are fixed either very shortly (36% of the cases). There are also cases in which they are fixed after a very long period (2% - after 2 years). This could be a consequence of the fact that, generally, a few changes are performed on Dockerfiles and there the probability of noticing the errors is higher in the short-term (e.g., until the Dockerfile works correctly) or, instead, it naturally increases with time, but very slowly. Also, developers perform changes on Dockerfiles mainly to optimize the build time and reduce the final image size, while there are only few changes limited only to the improvement of code quality. Even if Dockerfile smells are commonly diffused among Dockerfiles, developers are gradually becoming aware of the writing best practices for Dockerfiles. For example, avoiding the usage of which is deprecated, or they prefer to use instead of for copying files and folders as it is suggested by the Docker guidelinesFootnote 7. In addition, developers are open to approve changes aimed at fixing smells for the most common violations, but with some exceptions. Examples are the missing version pinning for apt-get packages (DL3008), which has received negative reactions from developers. However, version pinning, in general, is considered fundamental for other aspects, such as the base image pinning (DL3006 and DL3007), or the pinning of software dependencies (e.g., npm and pip).

To summarize, the contributions that we provided with our study are the following:

  1. 1.

    We performed a detailed analysis of the survivability of Dockerfile smells and manually validated a sample of smell-fixing commits for Dockerfile smells;

  2. 2.

    We introduced Dockleaner, a rule-based tool to fix the most common Dockerfile smells;

  3. 3.

    We ran an evaluation via pull requests of the willingness of developers of accepting changes aimed at fixing Dockerfile smells.

The remaining of the paper is organized as follows: In Section 2 we provide a general overview on Dockerfile smells and related works. Section 3 describes the design of our study, while in Section 5 we present the results of our experiment. In Section 6 we qualitatively discuss the results. Finally, Section 7 discusses the threats to validity and in Section 8 we summarize some final remarks and future directions.

2 Background and Related Work

Technical debt (Kruchten et al. 2012) has a negative impact on the software maintainability. A symptom of technical debt is represented by code smells (Becker et al. 1999). Code smells are poor implementation choices, that does not follow design and coding best practices, such as design patterns. They can negatively impact the maintainability of the overall software system. Mainly, code smells are defined for object-oriented systems. Some examples are duplicated code or god class (i.e., a class having too much responsibilities). In the following, we first introduce smells that affect Dockerfile, and then we report recent studies on their diffusion and the practices used to improve Dockerfile quality.

Dockerfile smells Docker reports an official list of best practices for writing Dockerfiles (Best practices for writing Dockerfiles 2022). Such best practices also include indications for writing shell script code included in the instructions of Dockerfiles. For example, the usage of the instruction instead of the bash command cd to change directory. This because each Docker instruction defines a new layer at the time of build. The violation of such practices lead to the introduction of Dockerfile smells. In fact, with Dockerfile smells, we indicate that instructions of a Dockerfile that violate the writing best practices and thus can negatively affect the quality of them (Wu et al. 2020). The presence of Dockerfile smells can also have a direct impact on the behavior of the software in a production environment. For example, previous work showed that missing adherence to best practices can lead to security issues (Zerouali et al. 2019), negatively impact the image size (Henkel et al. 2020), increase build time and affect the reproducibility of the final image (i.e., build failures) (Cito et al. 2017; Zhang et al. 2018; Henkel et al. 2020). For example, the version pinning smell, that consists in missing version number for software dependencies, can lead to build failures as with dependencies updates the execution environment can change. There are several tools that support developers in writing Dockerfiles. An example is the binnacle tool, proposed by Henkel et al. (2020) that performs best practices rule checking defined on the basis of a dataset of Dockerfiles written by experts. The reference tool used in literature for the detection of Dockerfile smells is hadolint (Hadolint 2022). Such a tool checks a set of best practices violations on a parsed AST version of the target Dockerfile using a rule-based approach. Hadolint detects two main categories of issues: Docker-related and shell-script-related. The former affect Dockerfile-specific instructions (e.g., the usage of absolute path in the commandFootnote 8). They are identified by a name having the prefix DL followed by a number. The shell-script-related violations, instead, specifically regard the shell code in the Dockerfile (e.g., in the instructions). Such violations are a subset of the ones detected by the ShellCheck tool (ShellCheck 2022) and they are identified by the prefix SC followed by a number. It is worth saying that these rules can be updated and changed during time. For example, as the instruction has been deprecated, the rule DL4000 that previously check for the usage of that instructions that was a best practice, has been updated as the avoidance of that instruction because it is deprecated.

Diffusion of Dockerfile smells A general overview of the diffusion of Dockerfile smells was proposed by Wu et al. (2020). They performed an empirical study on a large dataset of 6,334 projects to evaluate which Dockerfile smells occurred more frequently, along with coverage, distribution and a particular focus on the relation with the characteristics of the project repository. They found that nearly 84% of GitHub projects containing Dockerfiles are affected by Dockerfile smells, where the Docker-related smells are more frequent that the shell-script smells. Also in this direction, Cito et al. (2017) performed an empirical study to characterize the Docker ecosystem in terms of quality issues and evolution of Dockerfiles. They found that the most frequent smell regards the lack of version pinning for dependencies, that can lead to build fails. Lin et al. (2022) conducted an empirical analysis of Docker images from DockerHub and the git repositories containing their source code. They investigated different characteristics such as base images, popular languages, image tagging practices and evolutionary trends. The most interesting results are those related to Dockerfile smells prevalence over time, where the version pinning smell is still the most frequent. On the other hand, smells identified as DL3020 (i.e.,  usage), DL3009 (i.e., clean apt cache) and DL3006 (i.e., image version pinning) are no longer as prevalent as before. Furthermore, violations DL4006 (i.e., usage of pipefail) and DL3003 (i.e., usage of ) became more prevalent. Eng and Hindle (2021) conducted an empirical study on the largest dataset of Dockerfiles, spanning from 2013 to 2020 and having over 9.4 million unique instances. They performed an historical analysis on the evolution of Dockerfiles, reproducing the results of previous studies on their dataset. Also in this case, the authors found that smells related to version pinning (i.e., DL3006, DL3008, DL3013 and DL3016) are the most prevalent. In terms of Dockerfile smell evolution, they show that the count of code smells is slightly decreasing over time, thus hinting at the fact that developers might be interested in fixing them. Still, it is unclear the reason behind their disappearance, e.g., if developers actually fix them or if they get removed incidentally.

3 Study Design

The goal of our study is to understand whether developers are interested in fixing Dockerfile smells. The perspective is of researchers interested in improving Dockerfile quality. The context consists in 53,456 Dockerfile snapshots, extracted from 4,255 repositories.

In detail, the study aims to address the following research questions:

  • RQ\(_{\varvec{1}} {\textbf {:}}\) How do developers fix Dockerfile smells? We want to conduct a comprehensive analysis of the survivability of Dockerfile smells. Thus, we investigate what smells are fixed by developers and how.

  • RQ\(_{\varvec{2}} {\textbf {:}}\) Which Dockerfile smells are developers willing to address? We want to understand if developers would find beneficial changes aimed at fixing Dockerfile smells (e.g., generated by an automated refactoring tool).

3.1 Study Context

The context of our study is represented by a subset of the dataset introduced by Eng and Hindle (2021). The dataset consists in about 9.4 million Dockerfiles, in a period spanning from 2013 to 2020. To the best of our knowledge, the dataset is the largest and the most recent one from those available in the literature (Cito et al. 2017; Henkel et al. 2020; Ksontini et al. 2021). Moreover, such a dataset contains the change history (i.e., commits) of each Dockerfile. This characteristic allows us to evaluate the survivability of code smells (RQ\(_{{\textbf {1}}}\)). The authors constructed that dataset through mining software repositories from the S version of the WoC (World of Code) dataset (Ma et al. 2019).

3.2 Data Collection

To avoid toy projects, we selected only the repositories having at least 10 stars for a total of 4,255 repos, excluding forks. We also discarded the repositories where the star number is not available in the original dataset (i.e., the value is reported as NULL). We cloned all the available repositories from the selected sample to obtain the most updated commit data at the time our analysis started (i.e., March 2023). Next, using a heuristic approach, we (i) identified all the Dockerfiles at the latest commit, and (ii) we traversed the commit history to get all the commits and snapshots for the identified Dockerfile. In detail, for the first step, we processed all the source files contained in the repository and we evaluated if the file (i) contains the word "dockerfile" in the filename, and (ii) if contains valid and non-empty commands, i.e., can be correctly parsed using the official dockerfile parserFootnote 9. For each valid Dockerfile, we mined the change history using git log. We excluded the Dockerfiles having only one snapshot (i.e., no changes, referenced by only one commit). After this, we extracted a total of  220k commits corresponding to 53,456 unique Dockerfiles. In the end, we ran the latest version of hadolintFootnote 10 for each Dockerfile to extract the Dockerfile smells, if present.

Fig. 1
figure 1

Overall workflow of the experimentation procedure

4 Experimental Procedure

In this section, we describe the experimentation procedure that we will use to answer our RQs. Figure 1 describes the overall workflow of the study.

Fig. 2
figure 2

Example of a candidate smell-fixing commit that does not actually fix the smell

4.1 RQ\(_{{\textbf {1}}}\): How do developers fix Dockerfile smells?

To answer RQ\(_{{\textbf {1}}}\), we perform an empirical analysis on Dockerfile smell survivability. For each Dockerfile d, associated with the respective repository from GitHub, we consider its snapshots over time, \(d_1, \dots , d_n\), associated with the respective commit IDs in which they were introduced (i.e., \(c(d_1), \dots , c(d_n)\)). We also consider the Dockerfile smells detected with hadolint, indicated as \(\eta (d_1), \dots , \eta (d_n)\). For each snapshot \(d_i\) (with \(i > 1\)) of each Dockerfile d, we compute the disappeared smells as \(\delta (d_i) = \eta (d_i) - \eta (d_{i-1})\). All the snapshots for which \(\delta (d_i)\) is not an empty set are candidate changes that aim at fixing the smells. We define a set of all such snapshot as \( PF = \{d_i:|\delta (d_i)| > 0\}\). In the end, we obtain a set of smelly (\(d_{i-1}\)) and smell-removing commit (\(d_{i}\)) pairs. We implemented the described procedure as a basic heuristic approach, which (i) went through all the commits, (ii) executed hadolint to detect smells, (iii) returned the smelly and smell-removing commits pairs. The total time required was about nine hours.

Next, we manually evaluate the commit pairs to verify (i) that the changes that led to the snapshots in \( PF \) are actual fixes for the Dockerfile smell, and (ii) whether developers were aware of the smell when they made the change, and (iii) avoid any bias related to the presence of false positives in terms of smells (identified by hadolint). In detail, we manually inspect a sample of 1,000 of such candidate changes, which is statistically representative, leading to a margin of error of 3.1% (95% confidence interval) assuming an infinitely large population. We look at the code diff to understand how the change was made (i.e., if it fixed the smell or if the smell disappeared incidentally). Also, for actual fixes, we consider the commit message, the possible issues referenced in it, and the pull requests to which they possibly belong to understand the purpose of the change (i.e., if the fix was informed or not). We identify as smell fixing change a commit in which developers (i) modified one or more Dockerfile lines that contained one or more smells in the previous snapshot (i.e., commit), and (ii) kept the functionality expressed in those lines. For example, if the commit removes the instruction line where the smell is present, we do not label it as an actual smell-fixing commit. This is because the smelly line is just removed and not fixed (i.e., the functionality changed). Let us consider the example in Fig. 2: The package wget lacks version pinning (left). An actual fix would consist of the addition of a version to the package. Instead, in the commit, the package gets simply removed (e.g., because it is not necessary). Therefore, we do not consider such a change as a fixing change. Besides, we mark a fix as informed if the commit message, the possibly related pull request, or the issue possibly fixed with the commit explicitly reports that the modification aimed to fix a bad practice.

Two of the authors independently evaluated each instance. The evaluators discussed conflicts for both the aspects evaluated aiming at reaching a consensus. The agreement between the two annotators is measured using the Cohen’s Kappa Coefficient (Cohen 1960), obtaining a value of \(k=0.79\) considered “very good” according to the interpretation recommendations (Regier et al. 2013). The total effort required for the manual validation was about five working days, considering two of the authors that performed the annotation and discussed the conflicts.

Moreover, starting from the smell-fixing change, we go back through the change history to identify the last-smell-introducing commit, i.e., the commit in which the artifact can be considered smelly (Tufano et al. 2017), by executing git blame on the Dockerfile line number labeled as smelly by hadolint. In the end, we summarize the total number of fix commits and the percentage of actual fix commits. Moreover, for each rule violation, we report the trend of smell occurrences and fixes over time, along with a summary table that describes the most fixed smells. We also discuss interesting cases of smell-fixing commits.

Table 1 The most frequent Dockerfile smells identified in literature (Eng and Hindle 2021), along with the most fixed rules we identified in our study (reported with \(*\))

4.2 RQ\(_{{\textbf {2}}}\): Which Dockerfile smells are developers willing to address?

To answer RQ\(_{{\textbf {2}}}\), we first defined a list of rules, based both on the literature and the results of RQ\(_{{\textbf {1}}}\), and then implemented a rule-based refactoring tool, Dockleaner, to automatically fix them. We defined the fixing rules as described in the hadolint documentationFootnote 11. Next, we use Dockleaner to fix smells in existing Dockerfiles from open-source projects and submit the changes to the developers through pull requests to understand if they agree with the fixes and are keen to accept them. We describe these steps in the following sections.

4.2.1 Fixing rules for Dockerfile Smells

As a preliminary step, we identified a set of Dockerfile smells that we wanted to fix, considering the list of the most occurring Dockerfile smells, ordered by prevalence, according to the most recent paper on this topic (Eng and Hindle 2021). However, we excluded and added some rule violations. Specifically, among the missing version pinning violations, we excluded DL3013 (Pin versions in pip) and DL3018 (Pin versions in apk add) because they are less occurring variants (i.e.,  4% and  5%, respectively) of the more prevalent smell DL3008 (15%), even if concerning different package managers. Additionally, we include in Dockleaner the most occurring smells resulting from the analysis performed in RQ\(_{{\textbf {1}}}\) and not reported in the literature. We report in Table 1 the full list of smells target in our study, along with the rule we use to automatically produce a fix. It is clear that most of the smells are trivial to fix. For example, to fix the violation DL3020, it is just necessary to replace the instruction with for files and folders. In the case of the version pinning-related smells (i.e., DL3006 and DL3008), instead, a more sophisticated fixing procedure is required. We refer to version pinning-related smells as to the smells related to missing versioning of dependencies and packages. Such smells can have an impact on the reproducibility of the build since different versions might be used if the build occurs at different times, leading to different execution environments for the application. For example, when the version tag is missing from the instruction of a Dockerfile (i.e., DL3006), the most recent image having the latest tag is automatically selected. To fix such smells, we use a two-step approach: (i) we identify the correct versions to pin for each artifact (e.g., each package), and (ii) we insert the selected versions to the corresponding instruction lines in the Dockerfile. We describe below in more detail the procedure we defined for each smell.

Image version tag (DL3006) This rule violation identifies a Dockerfile where the base image used in the instruction is not pinned with an explicit tag. In this case, we use a fixing strategy that is inspired by the approach of Kitajima and Sekiguchi (2020). Specifically, to determine the correct image tag, we use the image name together with the image digest. Docker images are labeled with one or more tags, mainly assigned by developers, identifying a specific version of the image when pulled from DockerHub. On the other hand, the digest is a hash value that uniquely identifies a Docker image having a specific composition of dependencies and configurations, automatically created at build time. The digest of existing images can be obtained via the DockerHub APIsFootnote 12. Thus, the only way to uniquely identify an image is using the digest. To fix the smell, we obtain (i) the digest of the input Docker image through build, (ii) we find the corresponding image and its tags using the DockerHub APIs, and (iii) we pick the most recent tag assigned, that is different from the “latest” tag. An example of smell fixed through this rule is reported in Fig. 3.

Fig. 3
figure 3

Example of rule DL3006

Pin versions in package manager (DL3008) The version pinning smell also affects package managers for software dependencies and packages (e.g., apt, apk, pip). In that case, differently from the base image, the package version must be searched in the source repository of the installed packages. The smell regards the apt package manager, i.e., it might affect only the Debian-based Docker images. For the fix, we consider only the Ubuntu-based images since (i) we needed to select a specific distribution to handle versions (more on this later), and (ii) Ubuntu is the most widespread derivative of Debian in Docker images (Eng and Hindle 2021). The strategy we use to solve DL3008 works as follows: First, a parser finds the instruction lines where there is the apt command, and it collects all the packages that need to be pinned. Next, for each package, the current latest version number is selected considering the OS distribution (e.g., Ubuntu, Xubuntu, etc.), and the distro series (e.g., 20.04 Focal Fossa or 14.04 Trusty Tahr). The series of the OS is particularly important, because they may offer different versions for the same package. For instance, if we consider the curl package, we can have the version 7.68.0-1ubuntu2.5 for the Focal Fossa series of Ubuntu, while for the series Trusty Tahr it equals to 7.35.0-1ubuntu2.20. So, if we try to use the first in a Dockerfile using the Trusty Tahr series, the build most probably fails. The final step consists in testing the chosen package version. Generally, a package version adopts semantic versioning, characterized by a sequence of numbers in the format <MAJOR>.<MINOR>.<PATCH>. However, the specific versions of the packages might disappear in time from the Ubuntu central repository, thus leading to errors while installing them. Given that the PATCH release does not drastically change the functionalities of the package and that old patches frequently disappear, we replace it with the symbol ‘*’, indicating “any version,” in such a way the latest version is automatically selected. After that, a simulation of the apt-get install command with the pinned version is executed to verify that the selected package version is available. If it is, the package can be pinned with that version; otherwise, also the MINOR part of the version is replaced with the ‘*’ symbol. If the package can still not be retrieved, we do not pin the package, i.e., we do not fix the smell. Pinning a different MAJOR version, indeed, could introduce compatibility issues and the developer should be fully aware of this change. An example of a fix generated through this strategy is reported in Fig. 4. It is worth saying that we apply our fixing heuristic only to packages having missing version pinning. This means that we do not update packages pinned with another version (e.g., older than the reference date used to fix the smell). Moreover, in some cases, developers might not want the pinned package version, but rather a different one, despite the version we pin is most likely the closest one to the one they originally tested their Dockerfile on. For example, they want a newer version of that package (e.g., the latest). We discuss those cases during the evaluation phase of the automated fixes via pull requests.

Fig. 4
figure 4

Example of rule DL3008

Fig. 5
figure 5

Example of the pull request message. The placeholders (wrapped in curly braces) will be replaced with the corresponding values

4.2.2 Evaluation of Automated Fixes

To evaluate if the fixes generated by Dockleaner are helpful, we propose them to developers by submitting the patches on GitHub via pull requests. The first step is to select the most active repositories to ensure responses for our pull requests. To achieve this, we select a subset of repositories from our study context ensuring that, each repository, (i) contains at least one Dockerfile affected by one or more smells that we can fix automatically (reported in Table 1), and (ii) at least one pull request merged, along with commit activity, in the last three months. In this way, we select a total of 186 repositories containing 829 unique Dockerfiles affected by 5,403 smells. The next step is to associate each repository with a specific smell corresponding to a single Dockerfile to fix. This is to avoid flooding developers with pull requests.

We used a greedy algorithm to select the smell to fix in the Dockerfiles from the candidate repositories to ensure each of them is considered a balanced number of times. We start from the less occurring smells among all the available repositories, and we iteratively (i) select one target smell to fix, (ii) randomly select one Dockerfile candidate containing that smell, (iii) assign the repository to that smell to mark it as unavailable for the successive iterations, and (iv) increment a counter, for each smell, of the assigned Dockerfile candidates. The algorithm stops when there are no more repositories available. The counter of assigned smells is used, along with the overall smell occurrence, in the first step of the heuristic. This ensures that, for each iteration, we consider the smell (i) having the lower occurrence and (ii) is currently assigned for the fix to a lower number of repositories. In this phase we manually discard smells that can not be fixed by Dockleaner. For example, for DL3008, we only support Ubuntu-based Dockerfiles, but the smell might also affect the Debian-based ones. In total, we excluded 14 smells.

At the end of that procedure, we followed the commonly used git workflow best practices for opening the pull requests. Specifically, we first created a fork for the target repository. Then, we created a branch where the name follows the format fix/dockerfile-smell-DLXXXX. Finally, we signed-off the patches as it is required by some repositories (as well as being a good practice), and we submitted the pull request. To do this, we defined and used a structured template for all the pull requests, as reported in Fig. 5. We manually modified the template in the cases where the repository requires a custom-defined guidelines. The time required by Dockleaner to generate the fixing recommendations is only a few seconds for the simpler fixing procedures (e.g., replacing with ). For the more complex ones, such as version pinning, it can even take a few minutes.

For the evaluation, we adopted a methodology similar to the one used by Vassallo et al. (2020). In detail, we monitored the status of each pull request for more than 7 months (i.e., 218 days, starting from the last created pull request date) to allow developers to evaluate it and give a response. We interacted with them if they asked questions or requested additional information, but we did not make modifications to the source code of the proposed fix unless they are strictly related to the smell (e.g., the fixing procedure of the smell is reported as not valid). We report such cases in the discussion section. At the end of the monitoring period, we tagged each pull request with one of the following states:

  • Ignored: The pull request does not receive a response;

  • Rejected/Closed: The pull request has been closed or is explicitly rejected;

  • Pending: The pull request has been discussed but is still open;

  • Accepted: The pull request is accepted to be merged but is not merged yet;

  • Merged: The proposed fix is in the main branch.

For each type of fixed smell, we report the number and percentage of the fix recommendations accepted and rejected, along with the rationale in case of rejection and the response time. Also, we conducted a qualitative analysis of the developers’ interactions. In particular, we analyzed those where the pull request is rejected or pending to understand why the fix was not accepted. For example, the fix might have been accepted because the developers were not interested in performing that modification to their Dockerfile. Moreover, we analyze the additional information that the developer submits on rejected pull requests, from which we extract takeaways useful for both practitioners and researchers. Using a card-sorting-inspired approach (Spencer 2009) performed by two of the authors on the obtained responses, we identified a set of categories that we used to classify the developers’ reactions to rejected pull requests.

Fig. 6
figure 6

Occurrence over time for the top 10 Dockerfile smells

4.2.3 Data Availability

The code and data used in our study, along with the implementation of Dockleaner, can be found in the replication package (Rosa et al. 2024).

5 Analysis of the Results

In this section, we report the analysis of the results achieved in our study in order to answer our research questions.

5.1 RQ\(_{{\textbf {1}}}\): How do developers fix Dockerfile smells?

We report in Fig. 6 the trend of the 10 most occurring Dockerfile smells among the Dockerfile snapshots we analyzed. To plot this figure, we collected all the unique Dockerfiles (based on their path and repository) for each year, then we extracted and counted all the smells of the latest version of each of them (for each year).

The most occurring smell is DL3006 – version pinning for the base image–, followed by DL3008 – missing version pinning for apt-get–, which is also the most growing one, and DL4000 – . Since smell DL4000 became a bad practice in 2017Footnote 13 after the deprecation of the instruction, we excluded its occurrences before that date from the plot.

Table 2 Summary of fixed Dockerfile smells, reporting the number of fixes (manually validated), median time to fix (in days), and the magnitude of changes performed in the repository until the smell has been fixed (median number of commits). Only smells with at least 5 manually validated fixes are reported
Fig. 7
figure 7

Fixing trend over time for the 10 most fixed Dockerfile smells

In our manual validation, we found that 33.6% of the commits in which smells disappear actually fix smells. We report in Table 2 a summary of the characteristics of such commits for the smells for which we found at least 5 fixes (from a total of 572 fixed smells). In detail, we report the total number of fixing commits, and the average fixing time, measured both as days and the number of commits that elapsed between the last commit introducing a smell and the smell-fixing commit. Additionally, we report in Fig. 8 the adjusted boxplots describing the days that passed after each smell got fixed. We report in Fig. 7 the fixing trend over time for the 10 most fixed Dockerfile smells. Also, in this case, we consider only the changes which we manually validated as smell-fixing commits. However, this time, we consider each smell fixed separately. This means that, if a commit fixes 5 smells, we count the commit as 5 different fixes, one for each smell. The most fixed smell is DL3059 – multiple consecutive instructions. It is worth noting that we found this fix \(\sim \)3 times more frequently than any other fix. This is because we found that, when there are many consecutive instructions, developers tend to fix all of the occurrences of this issue in a single commit. Other common fixes are version pinning for base images (DL3006 and DL3007), along with DL4000 – and DL3020 – .

Fig. 8
figure 8

Overall fixing time delta (days) among all Dockerfile smells

Fig. 9
figure 9

Cumulative fixes over time interval (days) among all Dockerfile smells

We report in Fig. 9 the results of our survivability analysis of the smells by plotting the number of fixed smells in different amounts of time (the time is on a logarithmic scale). It is clear that most of the fixes have been performed within 1 day (203 instances). This means that when developers introduce Dockerfile smells, they immediately perform maintenance during the first adoptions. On the other hand, if a smell survives the first day, it is less likely that it gets fixed later. In fact, according to Table 2, the smells that survive the less are DL3048 ( ) and DL3042 (–no-cache-dir for pip install), which have been fixed in less than one day in most of the cases (100% and 60%, respectively). It is interesting to notice that two similar smells, i.e., DL3006 and DL3007, have largely different survivability. When the latest tag is explicitly used (DL3007) instead of being inferred (DL3006), the smell survives \(\sim \)5 times more (both in terms of days and commits, as reported in Table 2). However, it is worth noting that the effects of both tags are exactly the same.

We evaluated how many smell-fixing commits can be considered informed. We consider an informed fix when the developer explicitly mentions that the aim of the fix is to remove bad patterns in the commit message. We found that only 18 out of 336 manually validated fixes are informed. The most common smell explicitly addressed by developers is DL4000 (fixed in 4 cases) – . An example can be found in commit 811582f, from the repository webbertakken/K8sSymfonyReactFootnote 14. Among the remaining ones, DL3025 – – (4 cases) and DL3020 – – (3 cases) are the smells of which developers are more aware.

As for the non-informed cases, mainly developers report that the fix is aimed at (generically) improving the performance of the Dockerfile. Examples are the fixes for rule DL3059 explicitly performed to reduce the Docker image sizeFootnote 15 and the number of layersFootnote 16. In some cases, we found that developers use linters to detect bad practices. Among those, only one commit explicitly mentioned hadolintFootnote 17, while in other cases they mentioned the tool DevOps-Bash-toolsFootnote 18.

In the end, we can conclude that developers have a limited knowledge about Dockerfile best practices, in terms of the quality of the Dockerfile code. This is because they are more interested in the optimization of other non-functional aspects such as build time and size of the Docker image.

figure ah
Table 3 Opened pull requests and their resulting status sorted by number of accepted and merged PRs. The column Merged* reports the cumulative number of accepted patches (sum of accepted and merged)

5.2 RQ\(_{{\textbf {2}}}\): Which Dockerfile smells are developers willing to address?

In Table 3 we report the results of the evaluation performed via GitHub pull requests. In total, we submitted 143 pull requests. The majority of them have been accepted or merged by developers (58%). On the other hand, 23% them have been ignored, while 19% received an explicit rejection from the developers.

Fig. 10
figure 10

Average resolution time (days) for merged pull requests (a) and rejected pull requests (b)

Fig. 11
figure 11

Adjusted boxplot of the number of days required for a pull request to obtain a response (left) and to be merged/rejected (right)

The smells receiving the highest acceptance rate are DL4000 – – (92%) and DL3020 – – (71%), followed by rule DL3006 – version pinning for the base image– (69%). This is similar to what we reported for RQ\(_{{\textbf {1}}}\), where they resulted to be the most fixed smell among the manually validated smell-fixing commits. This means that developers care about those smells as they frequently fixed them and they are also willing to accept fixes. The smell DL3008 – missing version pinning for apt-get– has been the most rejected fix (47% acceptance), with only 3 accepted pull requests, along with smell DL4006 – use of pipefail for piped operations– which has been the most ignored one (50%). The low acceptance rate (33%) resulting for smell DL3009 (deletion of apt-get sources lists) is surprising, since developers are prone to reduce the image size, as we noticed in RQ\(_{{\textbf {1}}}\). Despite this, we can conclude that they do not prefer to remove apt-get source lists to achieve this goal.

In Fig. 11 we report the adjusted boxplot for the time required for pull requests to get the first response and to be resolved. Additionally, Fig. 10 reports the median resolution time, measured in days, of the submitted pull requests by smell type. For both of those figures, we only consider merged and rejected PRs, because they are the ones for which we have a definitive response from the developers. The smell DL3025 – – is the one that has been accepted in the shortest time interval, followed by DL3006 – version pinning for the base image– and DL4000 – . Despite the fixes for DL3020 – – are the second most-accepted ones, they have a median of 5 days to get accepted and merged.

On the other hand, the fixes for DL4006 – use of pipefail for piped operations– have been rejected almost immediately by developers. This also happens for DL3008 – missing version pinning for apt-get.

Finally, we report in Table 4 the reasons why developers rejected our pull requests. We assigned one or more categories, for each rejected change, by analyzing the responses for the 27 rejected pull requests. Most of the time, the fix has been considered invalid (22% of cases). This means that the proposed change was not a valid improvement for the Dockerfile. In 11% of cases, the developers did not accept the change as they use the Dockerfile in testing or development environments.

The rejections of the fixes for DL3008 are interesting: In 19% of the cases, the changes have been rejected because they are not perceived as a concrete fix. Furthermore, the fixes for that smell have been rejected because they could negatively impact the security of the image (8% of cases) or cause a build failure in the future (4% of cases).

Table 4 Categories of reasons why developers rejected our pull requests
figure an

6 Discussion

Despite the majority of the submitted pull requests got accepted, there are some specific smells that developers are not willing to address. Looking at Table 4, in 5 cases, the fix was rejected because the container was used in a testing or development environment. An example is the fix proposed for DL3009Footnote 19, where, even if the change can reduce the image size, it negatively impacts the image build time. Thus, for that reason, the change has been rejected. Probably, the concern about build time comes from frequent builds performed for that specific Dockerfile. A different example is the pull request submitted to envoyproxy/ratelimitFootnote 20, the reason for the rejection is that developers do not care about the version pinning (DL3007) as they use that Dockerfile for testing and they need to test the latest version of the software. This is not the same for DL3006 when the tag is missing. In that case, developers are more likely to accept the version pinning for the base image (see RQ\(_{{\textbf {1}}}\) and RQ\(_{{\textbf {2}}}\)).

figure ao

DL3008 constitutes a peculiar case. Fixing such a smell requires developers to pin the version of the apt-get packages to make the build more reproducible. Developers, however, believe that doing so might be misleadingFootnote 21, or it might make the build more fragileFootnote 22. Indeed, this happened for an accepted pull request, where after a month the version pinning for the package ca-certificates caused a build failure because the pinned version was not available anymoreFootnote 23. Moreover, the smell DL3008 led to interesting discussions. For example, a suggestion was to provide an automated script to periodically pin the package versions when there is an updateFootnote 24. For 3 of the proposed fixes, the developers additionally highlighted that they do not trust the change because it has been generated by an automated tool. This happened even if we specified that we manually checked the correctness of the change.

figure ap

In 6 cases, instead, developers did not perceive the change as correct or sufficient for a fix. This happens, for example, in commits 5531f2eFootnote 25 (DL3020) and 320ba87Footnote 26 (DL4006). An interesting discussion arose for the rejected fix of DL3003Footnote 27. The fix for that smell provides the replacement of \(\mathtt {``cd <path>''}\) with . However, for that particular case, fixing the smell required putting a instruction before the smelly code block and another after to switch back to the previous working directory. This is because the target smelly code temporarily changes the working directory to operate on specific files. In other words, there are cases in which developers believe it is legitimate to change the working directory through \(\texttt{cd}\) (mostly, when this change is temporary). We report an example in Fig. 12, where the fix has been rejected because the change of the working directory is temporary.

Fig. 12
figure 12

Example of a wrong fix for DL3003. In that case, the change of working directory is temporary, and the fix has been rejected

We conclude that, in similar cases, the detected smell is a false positive. This is because the fix will increase the number of layers, in addition to redundant instructions. This negatively impacts the code quality of the Dockerfile.

Comparing the results from RQ\(_{{\textbf {1}}}\) and RQ\(_{{\textbf {2}}}\), we can conclude that there are no big differences between the fixes that developers have applied and the changes that we propose via pull requests. The most performed fixes, which are also in the most accepted pull requests, are those related to (DL4000), version pinning for the base image (DL3006), and (DL3059). There is a difference in terms of the most fixed one. While in the wild developers tend to fix more DL3059, in our pull request the most fixed one is DL4000. As also shown in RQ\(_{{\textbf {1}}}\), they pay more attention to performance improvements over code quality, for which they are not fully aware of what is the current writing best practicesFootnote 28. In fact, DL4000 is purely related to writing best practices and does not affect performance. When faced with a ready-to-use fix, however, they tend to prefer the ones that more likely will not disrupt the Dockerfile.

In general, developers keep more attention to the impact of the change on the build process and the image size, instead of the impact on the quality of the Dockerfile code. Reporting an example among the accepted pull requests, we have the fix proposed for the smell DL3015 (–no-install-recommend flag for apt)Footnote 29, where the developers explicitly asked to fix another Dockerfile affected by the same smell because it decreases the size of the built image.

figure au

Additionally, it is interesting to analyze more in depth the differences in terms of performed fixes for DL3048 ( ) and DL4000 ( is deprecated, replace with ). Actually, there are two possible ways to format Dockerfile labels. The first one follows the standard format defined by opencontainers Footnote 30, which is also suggested for DL4000 in the official Docker documentation Footnote 31. The second is more general and does not enforce a pre-defined format. It is reported in the hadolint documentation Footnote 32, which also is reported in the official Docker documentation as examples of instructions Footnote 33. The fixes that we analyzed in RQ\(_{{\textbf {1}}}\) that follow the first format are limited only to one repositoryFootnote 34. In other cases, developers adopted the second formatFootnote 35. The fixes proposed via pull requests, instead, follow the second format where for DL4000 we got the highest acceptance rate. This is probably because the second format is more general, avoiding unnecessary constraints and changes on the instructionsFootnote 36.

Moreover, while in this context the fix is still sufficient to correct the smell, in other contexts our fixing procedure could not be correct. The most evident case is for the smell DL3059 ( ). In fact, open-source developers tend to fix it mainly by compacting the installation of software packagesFootnote 37. In our pull requests, instead, we merge all the subsequent instructions until a comment or a different instruction is found. This could mean that a more complex and informed fixing procedure should be adopted in order to better improve the size and performance of Dockerfiles. Thus, a more advanced approach in that direction could be useful to improve the fixing procedure, taking also into account the aspects that developers are interested to improve (image size and build time). To this aim, considering the scenario in which we are using a debian base image, an advanced approach to fix smell DL3059 could be a heuristic that (i) selects all the instructions that are aimed at installing dependencies, (ii) extracts the list of such dependencies, taking also into account if they require external sources lists, and (iii) combine all those installations into a single instruction at the top of the Dockerfile. In this way, the re-build time will be reduced thanks to the layers caching system. At the same time, the image size will be reduced since there will be fewer layers and less space wastage (e.g., package cache). For smell DL3003, instead, an advanced fixing approach should target the bash code to correct the usage of the pattern , rather than using . In the example reported in Fig. 12, the smell could be fixed by using the absolute paths instead of the relative paths for the command (e.g., ). While this can be done in this case, there are other scenarios in which this could be detrimental. For example, if a custom script writes the output files in the current directory, it is still necessary to use cd before running it. Thus, such a fixing procedure should be applied only for specific bash instructions patterns (like the previously-mentioned one).

figure bh

7 Threats to validity

Construct validity The threats to construct validity are about the non-measurable variables of our study. More specifically, our study is heavily based on the rule violations detected by hadolint. Other tools are able to detect bad practices in Dockerfiles, such as dockleFootnote 38. We choose hadolint which is commonly used in the literature (Cito et al. 2017; Wu et al. 2020; Lin et al. 2022; Eng and Hindle 2021 and also in enterprise tools for code qualityFootnote 39. However, hadolint could lead to false positives or can miss some smellsFootnote 40. The manual evaluation we performed on the smell-fixing commits validated the identified smells and those that have been removed. During that evaluation, we noticed that hadolint mainly fails to detect the rule DL3059 (consecutive instructions). To reduce this impact of this threat on our study, we manually annotated the lines in which the smell was present.

Internal validity The threats to internal validity are about the design choices that we made which could affect the results of the study. In detail, we used as a study context a sample of repositories extracted from the dataset provided by Eng and Hindle (2021) by considering only those having stargazers count greater or equal to 10. This is commonly used in the literature to avoid toy projects (Dabic et al. 2021). There can be a bias in the selected smells for our fix recommendations. We selected the most occurring smell as described in the analysis of Eng and Hindle (2021). We assume that an automated approach would have the biggest impact on the smells that occur more frequently. Also, at least for some of them, the reason behind the fact that they do not get fixed might be that they are not trivial (i.e., an automated tool would be helpful). The fixing procedure for some of the selected smells can be wrong, and some smells might not get fixed. We based the rules on the fixing procedure on the Docker best practices and on the hadolint documentation. Still, to minimize the risk of this, we double-checked the modifications before submitting the pull requests and manually excluded the ones that make the build of the Dockerfile fail. Thus, we ensured the correctness of the fixes generated by Dockleaner, submitted via the pull requests, for the cases evaluated in our study. However, it is still possible that the tool produces wrong fixes for other Dockerfiles. For example, the version pinning fixes could fail in the cases in which the package is not reachable (i.e., DL3008), or the Docker image digest is not available in DockerHub (i.e., for smells DL3006 and DL3007). It is worth noting, indeed, that our aim is not to evaluate the tool, but rather to understand if developers are willing to accept fixes. Moreover, there is a possible subjectiveness introduced of the manual validation of the smell-fixing commits, which has been mitigated with the involvement of two of the authors and the discussion of the conflicts. Also, it is important to say that the two evaluators have more than 3 years of experience with Dockerfiles development and Docker technology in general, allowing them to have a good understanding of the smells and the applied fixes. Finally, we performed the selection of the last-smell-introducing commits by using the git blame command on the smelly lines identified by hadolint. Since hadolint can fail to detect some smells, in some cases, the lines impacted by the fix are different from the ones identified by hadolint. This means that we got some false positives while we identify the last-smell-introducing commits. Since our results showed that Dockerfiles are not frequently changed, we believe that the impact of this threat is limited.

External validity External validity threats concern the generalizability of our results. In our study, we considered a sample of repositories from GitHub containing only open-source Dockerfiles. This means that our findings might not be generalized to other contexts (e.g., industrial projects) as developers could handle smell in a different way.

8 Conclusion

In the last few years, containerization technologies have had a significant impact on the deployment workflow. Best practice violations, namely Dockerfile smells, are widely spread in Dockerfiles (Cito et al. 2017; Wu et al. 2020; Lin et al. 2022; Eng and Hindle 2021). In our empirical study, we evaluated the Dockerfile smell survivability by analyzing the most fixed smells in open-source projects. We found that Dockerfile smells are widely diffused, but developers are becoming more aware of them. Specifically, for those that result in a performance improvement. In addition, we evaluated to what extent developers are willing to accept fixes for the most common smells, automatically generated by a rule-based tool. We found that developers are willing to accept the fixes for the most commonly occurring smells, but they are less likely to accept the fixes for smells related to the version pinning of OS packages. To the best of our knowledge, this is the first in-depth analysis focused on the fixing of Dockerfile smells. We also provide several lessons learned that could guide future research in this field and help practitioners in handling Dockerfile smells.