AI Art in Architecture
AI Art in Architecture
https://doi.org/10.1007/s43503-023-00018-y
AI art in architecture
Joern Ploennigs1* and Markus Berger1
Abstract
Recent diffusion-based AI art platforms can create impressive images from simple text descriptions. This makes them
powerful tools for concept design in any discipline that requires creativity in visual design tasks. This is also true
for early stages of architectural design with multiple stages of ideation, sketching and modelling. In this paper, we
investigate how applicable diffusion-based models already are to these tasks. We research the applicability of the plat-
forms Midjourney, DALL·E 2 and Stable Diffusion to a series of common use cases in architectural design to determine
which are already solvable or might soon be. Our novel contributions are: (i) a comparison of the capabilities of public
AI art platforms; (ii) a specification of the requirements for AI art platforms in supporting common use cases in civil
engineering and architecture; (iii) an analysis of 85 million Midjourney queries with Natural Language Processing
(NLP) methods to extract common usage patterns. From this we derived (iv) a workflow for creating images for inte-
rior designs and (v) a workflow for creating views for exterior design that combines the strengths of the individual
platforms.
Keywords Image generation, Diffusion models, Natural language processing, Architecture
1
https://aecmag.com/technology/ai-special-edition-of-aec-magazine/.
*Correspondence:
Joern Ploennigs
[email protected]
1
AI for Sustainable Construction, University of Rostock, Rostock, Germany
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the
original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this
licence, visit http://creativecommons.org/licenses/by/4.0/.
Ploennigs and Berger AI in Civil Engineering (2023) 2:8 Page 2 of 11
1.2 State of the art in generative methods models (img2img) are used. They either change the style
The potential benefits of AI art platforms for creative or arrangement of the image based on the text prompt.
work is hard to overstate. These AI art platforms are all If a certain part of the original image was deleted, the
using generative machine learning models, specifically model can replace it with entirely new content based on
text-to-image generative models. Despite their speciali- the prompt. This approach is called inpainting by Lug-
zation on generating images, many are based on the nat- mayr et al. (2022). A similar approach is outpainting, also
ural language model GPT-3. GPT-3 is trained to generate called uncropping by Saharia et al. (2022), which adds
text that completes a textual input query Brown et al. additional content outside the image. If a user is request-
(2020). More precisely, it predicts the next possible com- ing changes to the original image without manually
binations of words to an input text. The specific Image deleting or masking out parts, then this is called image
GPT-3 models used by AI art platforms are trained to editing. Image editing is not yet available in commercial
instead predict the next cluster of pixels, called patch, in AI art platforms, but there is recent work in single-image
an image for a given input text. editing through text prompt, for example by Kawar et al.
The most recent generation of generative models (2022). Is another diffusion step applied to add more
combines natural language and so-called diffusion mod- details to the image in a higher resolution then it is up-
els. The idea for diffusion models was first proposed in scaled, a method also called super-resolution by Saharia
Sohl-Dickstein et al. (2015), in which (structured) image et al. (2022). Often platforms offer multiple or all these
information is slowly destroyed through a forward diffu- methods, with configurable weights between the individ-
sion process that introduces noise into the image data and ual images and words.
then generated anew through a reversed diffusion process. Assessing the quality of all these models and architec-
This reverse process generates completely new image tures systematically is difficult. Attempts at quantitative
data, as the original information was fully destroyed by evaluation are being made such as Borji (2022). However,
noise. This approach was constantly improved over the in such cases it is difficult to evaluate subjective metrics
years with a strong focus on optimizing the underlying like style, i. e., whether an oil painting style looks better
neural network architectures Ho et al. (2020), resulting than a photorealistic one.
in several variants, like OpenAI’s GLIDE model Nichol As for research into use cases from architecture, Senev-
et al. (2022). It consists of an encoder that creates a text iratne et al. (2022) describe a systematic grammar for
encoding based on the user prompt, a model implement- using DALL·E for the purpose of generating images in
ing the diffusion based on this text encoding, as well as the context of urban planning. They open-sourced 11.000
an upsampler that upscales and denoises the result. Cur- images generated with Stable Diffusion and 1.000 created
rent diffusion models often implement the process of with DALL·E by that grammar. They found that, though,
text encoding and the association of those text encodings many realistic images could be generated, the model has
with image parts with the CLIP (Contrastive Language weaknesses in creating real-world scenes with a high level
Image Pre-training) architecture presented by Radford of detail. But, these models advance quickly and DALL·E
et al. (2021) and used by DALL·E 1. 2 outperforms DALL·E 1 significantly.
One of the currently most advanced incarnation of the Recent progress was also made in generating video
technology is DALL·E 2 (from here on out simply referred (Ho et al., 2022), 3D models via point clouds (Luo and
to as DALL·E), based on the unClip method developed by Hu, 2021; Zhou et al., 2021; Zeng et al., 2022), and even
Ramesh et al. (2022). It uses one image encoder for both 3D animation data (Tevet et al., 2022). While this is
text and images into a diffusion-based joint representa- beyond the scope of this paper, especially the generation
tion space (the prior). The image generation is done by of 3D models will be revolutionary for architecture. The
a similarly trained decoder, which translates the prior’s 3D workflows would be similar to the image-based work-
encoding back into an image. Another main platform flows we present.
in the field is the open-source model Stable Diffusion,
which is also based on the CLIP text encoder. The third 2 Model Architectures and Interfaces
contender, Midjourney, does not published their models, Although the concept of generative art has been a
but it is assumed that it is using a similar structure. research area for years, it only entered public percep-
This kind of diffusion model architecture can solve var- tion with the advent of publicly available diffusion model
ious image generation tasks. Is a completely new image platforms like DALL·E, Midjourney or Stable Diffusion,
generated from a user-written text prompt then a text- which combine txt2img, img2img, inpainting and upscal-
to-image model (txt2img) is used. Is an existing image ing into easy-to-use workflows.
modified based on a text prompt then Image-to-image
Ploennigs and Berger AI in Civil Engineering (2023) 2:8 Page 3 of 11
Fig. 1 Model architecture and image generation process in different models. Grey elements show the AI workflow, coloured elements the user
interaction
These models are not only competing on the technical best results in multiple ways, with limited possibilities to
level, but also in terms of user experience. Midjourney2 remix the original text prompt throughout.
directly interacts with its community by sharing queries In contrast, both DALL·E and Stable Diffusion allow
across public (or private) channels in the Discord mes- direct editing of uploaded images or previous results.
saging app. They do not provide a dedicated user inter- They do not provide traditional image editing tools like
face, but simply return the generated images as a chat drawing, filling, layering, or stamping. Instead, all image
response to the query. Direct interactions are possible editing must be done by img2img-based operations like
with attached links that usually result in new queries. erasing sections (inpainting) or extending the canvas
In contrast, DALL·E 23 is only accessible by individu- (outpainting). All networks have ways to create images
als through a dedicated web-based user interface with of different sizes and aspect ratios, either by specify-
authoring and editing tools. It provides a simply query ing the size in the query or by altering it later through
interface without additional query parameters. Stable outpainting.
Diffusion on the other hand is released as open source, Notably, image generation only takes a few tens of sec-
which fosters a plethora of community-created tools that onds in all models, making it fast enough to use in crea-
are usually used more than the official web-interface4. tive sessions alone or with clients. All three models also
The internal model architecture as well as the interface allow importing external images in some capacity. There-
paradigm influences how these models can be utilized. In fore, they can easily be combined into composite work-
Fig. 1 we illustrate some of the similarities and differences flows, as shown in Sect. 5.2.
between DALL·E 2, Stable Diffusion (v2.1), and Mid- One aspect not included in Fig. 1 is the training data.
journey (v4). The core workflow of foundational models There is little information on the training data used for
in grey are similar across technologies as stated before. DALL·E and Midjourney. However, Stable Diffusion was
Differences lie in the workflows that they provide, which trained on the LAION-5B dataset (Rombach et al., 2022),
often results from their different interface approaches. which is based on image and text data scraped from the
It is apparent in Fig. 1 that while the core workflow in web. Similar internet datasets are very likely the source
grey is similar, the ways to refine their results do vastly for Midjourney and DALL·E. However, it is evident that
differ. Only Midjourney offers successive upscaling of either biased by the training data or the training process,
resolution and does so with multiple different sizes. Thus, these models have developed very different image styles.
Midjourney’s workflow focuses on generating and com- DALL·E and Stable Diffusion are good in generating both
paring different image variants and then upsampling the drawn images as well as photorealistic outputs. Midjour-
ney tends towards a more artistic style, especially with
earlier model versions. But, with carefully chosen key-
words most styles can now be targeted by all models.
2
https://midjourney.com.
3
https://openai.com/dall-e-2.
4
https://beta.dreamstudio.ai.
Ploennigs and Berger AI in Civil Engineering (2023) 2:8 Page 4 of 11
Table 1 Top part—comparison of platforms with their supported features; Lower part—Mapping of architectural use cases to
features
Model txt2img img2img In-/Out-paint. Editing Upscaling Semantics
DALL·E 2 ♦ ♦ ♦
Midjourney v4 ♦ ♦ ♦
Stable Diffusion v2.1 ♦ ♦
Use Case
Ideation / / ♦/ ♦/♦ ♦/ ♦/♦
Sketches / / / ♦/♦ ♦/ /♦
Collages ♦/♦ /♦ / /♦ ♦/♦ ♦/♦
Image Combination ♦/♦ /♦ / /♦ /♦ ♦/♦
Build Variants ♦/♦ / / ♦ ♦/♦ /♦
Style Variants ♦/♦ / ♦/♦ ♦/♦ / ♦/♦
Construction Plans /♦ ♦/♦ ♦/♦ ♦ /♦ /♦
Exterior Design / / / /♦ / /♦
Interior Design / / / /♦ / /♦
Creating Textures / ♦/ ♦/♦ /♦ / ♦/♦
full, limited, bad, or no support; high, some, low, or no importance; versus (/) how well it works: well, somewhat, a little, not at all
3 Architectural use cases through inpainting for individual items, but not
Given the discussed differences of the platforms, they through generic requests like: “Add many people”.
vary in the architectural use cases that they support. • Image combination: Taking multiple existing image
To analyse this, we collected a series of use cases where elements (for example multiple buildings), arranging
architects and planners normally create or edit images. them on a canvas and then creating a coherent com-
A direct comparison between the platforms and the use posite image.
cases is often complicated, because the style and quality • Build variants: Taking an existing sketch or picture
of the results heavily depends on the input prompts. We and generating versions in which certain elements
identified that the more differentiating factor is, whether are altered (like adding a garage). This works well
a platform supports a specific technical feature that is through inpainting.
required to realize the use case. • Style variants: Taking an existing image and trans-
Therefore, we evaluated for each use cases what fea- forming its style (e. g., a sketch to photorealistic art
tures they require and how qualitatively well they are deco) without changing content. This works well
supported by each platform. In addition to the image with certain models.
operations explained in the previous chapter, we also • Construction plans: Creating detailed layout plans
consider support for architectural semantics, i. e., struc- to establish spatial relations. This rarely works as
tured knowledge that goes beyond common image train- the models do not understand the semantics of line
ing datasets. Table 1 shows the results for the use cases styles, areas, etc.
that we discuss below: • Exterior design: Finding style and feeling for a build-
ing and the surrounding area/landscape. This works
• Ideation: Developing ideas by randomly generating well for common scenarios.
images for inspirations. This is what txt2img mod- • Interior design: Finding a style or feeling for an inte-
els are made for and it works splendidly. Additional rior space. This also works well in many scenarios.
image prompts can add style and object references. • Creating txtures: Creating tiled patterns to serve as
• Sketches: Drawing architectural sketches with specific surface materials for 2D or 3D models. This is cur-
target style and items. This works well for common rently a unique feature of Midjourney.
examples in the training data, but, less so with spe-
cific requests. Even though Stable Diffusion seems to support fewer
• Collages: Combining and filling existing images with features than DALL·E 2, it’s main advantage can not be
life by adding people and objects. This can be done overstated: It is possible to run it locally and train it on
Ploennigs and Berger AI in Civil Engineering (2023) 2:8 Page 5 of 11
own data to introduce e. g. new architectural concepts. “creative” or “full” have high overall frequency, but, low
Together with the high quality of Stable Diffusions out- frequency in architectural context. Other terms like
puts, this makes it the most potent of the three models to “architecture”, “interior”, “house” do only occur exclu-
be specialized on architectural drawings. sively within our filtered results as they are part of our
keyword list.
4 Analysis of architectural queries Figure 2(b) shows our extended keyword list and their
We also explored how people use these AI art platforms respective frequency. As we filter on these keywords,
in practice. We analysed about 85 million user queries their total frequency is identical with the filtered one
that we collected over one year since Jan. 30th , 2022 from and we do not display it. It is of note that “architecture”
Midjourney. It is the only platform for which many user and “interior” keywords are the most and third frequent
queries are publicly visible. Midjourney uses the Discord keyword. It is notable that “interior” is six times more
messaging app as its main interface, which allowed us to popular than “exterior” design as keyword, but, this
monitor the public channels for queries that we consider simply may be that users refer to it implicitely trough
to be of an architectural nature. We selected queries con- “architecture”.
taining either the word “architect”, “interior” or “exterior” Figure 2(c) shows the frequency of famous architects
design or one of 38 architectural keywords like “building”, that we extracted from 430.333 queries that referred to
“facade”, or “construction” (listed in Fig. 2 (b) and5). We at least one of them. Zaha Hadid is by far the most fre-
identified these keywords by selecting only those from quently queried architect, given her well recognizable
architectural glossaries6, 7 that co-occurred in at least parametric style that is popular in the community of
10 % of all cases with “architect”, “interior” or “exterior” people experimenting with AI tools. Michelangelo and
in the queries. We also added to the list of keywords the William Morris are second and third. The low red bar
names of 941 famous architects from Wikipedia8 as we shows that they are usually not used in architectural
noted that several queries refer to their style by naming context but for their art contributions. Adrian Smith
them. By applying these filters, we identified 5.7 mil- is also often used in other context probably referring to
lion queries (6.7 %) with potential architectural intent the musician. The architects Frank Lloyd Wright, Tadao
including 2.2 million queries (2.6 %) explicitly containing Ando, Frank Gehry, Antoni Gaudi, Lebbeus Woods, and
“architect”, “interior” or “exterior” design. Peter Zumthor complete the top 10 and are often used in
In the next steps we filter out stopwords (e. g. “a”, “and”, explicit architectural context given the strong red bar.
“the”) and build a Word2Vec model (Mikolov et al., 2013) Figure 2(d) shows the links between keywords and the
from these queries to get a model of the occurrence and most likely connected term. We analysed this by predict-
co-occurrence of terms in these queries. For under- ing with the Word2Vec for each keyword on the left the
standing the results, it is important to know that most most probably co-located word on the right, weighted
Midjourney users do not formulate full sentences, but by probability. Interesting combinations here are links
a prompt is rather a collection of terms that refer to the between interior-design, floor-plan-drawing, architec-
content, style, or render quality of the targeted image. ture-visualization, cathedral-gothic, or swimming-pool.
Figure 2 visualizes the main results of our analysis. Fig- From this it is possible to build an of auto-complete func-
ure 2(a) shows the most frequent terms used in the fil- tion for architectural queries.
tered queries. The colour blue represents the frequency Figure 2(e) shows the mean length of queries depend-
across all 85 million queries, red is the frequency within ing on whether they got upscaled, remastered or left in
the filtered 5.7 million queries and green within 2.2 mil- draft mode. A draft mode image is of low image size,
lion queries explicitly containing “architect”, “interior” so users will normally upscale or remaster them if
or “exterior” design. Note that red, green, and blue are they like one of the variants. It is notable that for the
mutably inclusive and overlap. The top 10 of words has a medium upscale options as well as the remastered ver-
similar frequency across all three classes. Many of those sion the mean query length increases above 35 terms
refer to Midjourney style commands like “detailed”, “real- per query in comparison to 30 terms for draft mode
istic”, or “cinematic”. However, some terms like “black”, queries. We also manually classified the most frequent
150 terms into the categories: style, content, quality. It
is notable that for the upscale and refined queries, the
5
Keywords are not listed in Fig. 2(b) due to low count: balcon, basilica, bat- percentage of style terms increases significantly from
tlement, buttress, gable, hvac, latticework, livingroom, minaret, panelling, 6.6 % to 8.3 %.
pavilion, plinth, rotunda, spire.
6
Figure 2(f ) shows the mean number of iterations
https://www.heritage.nf.ca/articles/society/architectural-terms.php.
7
needed to develop a query. We classify a query as itera-
https://en.wikipedia.org/wiki/Glossary_of_architecture.
8 tion if the same user is rerunning the same or extended
https://en.wikipedia.org/wiki/List_of_architects.
Ploennigs and Berger AI in Civil Engineering (2023) 2:8 Page 6 of 11
query within 30 min. 54 % of all unique queries (sin- • these queries are usually refined in more than 4 itera-
gle, 852k) are run only once. The other half of the que- tions in mean;
ries are improved in multiple iterations. Queries that • these queries usually contain more keywords specify-
remain in draft mode require about 3.7 steps. 7.8 % of ing style and quality.
the queries are good enough to be upscaled require
about 6.75 steps in total. They are upscaled after 4.1
draft steps into different variants (light, medium/beta,
max). 5.3 % queries that are remastered take about 5.1 5 Case studies
iterations. They have only 1.4 draft mode queries, but In this section, we present some refined workflows for
1.2 remastering steps, and 2.3 final upscale steps. architects to utilize the strengths of all three AI art plat-
This analysis illustrates that users do not come-up forms. These workflows are based on the learning of our
with perfect queries from scratch. We can derive mul- analysis and by a large number of experiments to iden-
tiple insights: tify the workflows leading to the best results. It should
be noted that Midjourney v3 is in use here, which raises
• 6.7% (5.7 million) of all queries are related to archi- the importance of the remastering step. Remastering
tecture; can usually be skipped as of model version 4. As we have
• “architecture” is the most popular keyword, identified in the analysis, users rarely run only a sin-
and “interior” is six times more popular than “exte- gle query and gain a perfect result—instead they iterate
rior” as a keyword; many times. This requires a good understanding of effec-
• only 13.1 % of all unique queries upscaled or remas- tive workflows to avoid dead ends and come to adequate
tered; results quickly.
Ploennigs and Berger AI in Civil Engineering (2023) 2:8 Page 7 of 11
5.1 Interior design—comparing the models improves material quality and overall detail, but the cen-
First, we will look at how the models perform purely on tral sofa remains as an incoherent form in the centre of
their own. The example will be an interior-design sce- the image. Anytime the normal image output does not
nario. We start with a simple query without any special attain a sufficient level of cohesion and realism, we can
command of a platform. On all three we try to generate a invoke the “remaster” step, shown upscaled in (d). How-
high-quality rendering of a room using the prompt “cozy ever, even this last result contains smaller perspective
living room, wood paneling, television, large sofa, natu- errors, which are difficult to fix without intervention
ral light, lived in, realistic, full view”. We developed this through manual image editing.
prompt by watching common prompting patterns in the The DALL·E 2 first results in Fig. 3(e) shows that it
Midjourney query data and testing out different iterations struggles to correctly response to the “realistic” term in
across models until we arrived at that final version. Each the query. Once one of the two realistic variants is picked,
comma is considered as topic separator by the AI art plat- the next variant generation step in (f ) creates more use-
forms. Thus, we are asking for an image of a (i) cozy liv- ful results. To remove perspective or coherence errors
ing room that (ii) has wood paneling; (iii) contains a TV; we mask out certain areas of the image in (g) to generate
(iv) a large sofa, etc. With this we ensure that the resulting inpaint variants. The final variant is shown in (h). Of note
image should contain similar elements across platforms. is that even the inpainted regions react correctly to the
Midjourney starts out with several results that are styl- previously established lighting of the scene. This result
ized or of strange perspective, visible in Fig. 3(a, b). The is of good quality, but cannot be upscaled any further
first upscale of the chosen interior design in (c) greatly within the web interface.
(a) Midjourney original (b) Variant selection (c) Upscale (d) Remaster and
query maximum upscale
(e) DALL·E original (f) Variant selection (g) Erase tool to remove (h) Final inpaint with
query artifacts added “table”
(i) Stable Diffusion (j) Resize selection and (k) Variants of (l) Final selection
original query Erase artifacts in-/outpainting
Fig. 3 Minimal workflow for Midjourney (a–d), DALL·E 2 (e–h), and Stable Diffusion (i–l) for the given query
Ploennigs and Berger AI in Civil Engineering (2023) 2:8 Page 8 of 11
Stable Diffusion starts out in Fig. 3(i) with much In contrast, DALL·E and Stable Diffusion provide
stronger results than its two contenders, generating more control over changes with their in- and outpaint-
images that incorporate the “realistic” and “lived in” ing tools. Once we understand these tools and strengths
aspect of the query very well. These rooms look like of the platforms, we can combine them into a more flex-
inhabited and not like artificial renderings. However, ible workflow. In the following ideation workflow, we will
all images seem like close-ups of a proper interior view. start with a Midjourney prompt to create a desired scene.
Which is why we do not just use inpainting in (j) to fix While Stable Diffusion tends to generate better looking
errors, but also add additional canvas space for outpaint- first results, Midjourney is a strong contender once the
ing. This results in some quite incoherent variants for the remaster step is done, and with its workflow excels at
outpainted areas. After some additional iterations the free-form ideation, which makes it the most appropri-
variant in (l) was selected as the best. ate starting point for an exterior design. From the Mid-
Overall, Stable Diffusion performs best in this scenario. journey remaster stage, we will then refine any errors or
All of its first variants were realistic and showed no real undesired results through in-/outpainting in DALL·E and
deficiencies in the later steps. Only during the final out- Stable Diffusion to attain the targeted result. An overview
painting stage it was necessary to manually smooth the of the workflow is provided in Fig. 4 and we will explain
transitions. Midjourney generated a final result of simi- it along the results in Fig. 5. The workflow highlights
lar quality, however offered a weaker beginning selection, how the different image editing steps can be combined
with not all images containing all prompted elements to get the best results independent of the AI tool used.
(e. g. missing TV) and frequent perspective errors and The specific models most appropriate for each step may
other visual faults. change with newer versions. The logic behind the work-
flow is targeting the best image quality, by: (1) generating
5.2 Exterior design—combining the models the image; (2) upscaling it; (3) extending the canvas; (4)
The best results can be archived by combining the finally editing details with inpainting.
strengths of all three models and knowing their spe- Figure 5(a–c) shows the beginning stages of an idea
cific command keywords. For example, one of the first as generated in Midjourney. The query that led to this
hints that DALL·E shows new users is the information particular result was “single-family home with garden,
that the keyword “digital art” can significantly improve full exterior view, modern architecture, photo, sunlight”.
many prompts, which is deeply related to the data it was Multiple keyword arrangements and weights on different
trained on. terms were tried before this result was selected, remas-
In the case of Midjourney, refinement starts as a pro- tered and then upscaled. It is also possible to start with
cess of including and removing certain phrases within one or more reference images, though that technique was
the prompt to get as close to a desired style as we can. omitted here.
These phrases can be very convoluted. It often helps to The result was then uploaded to the DALL·E, where
include the kinds of modelling software that would cre- it was outpainted in Fig. 5(d) to create a wider viewing
ate the desired kind of image (like “octane render” or angle and subsequently inpainted in Fig. 5(e, f ) to replace
“cinema3D”) or even quality signifiers like “top 10 on art- unwanted details like the cables hanging in the air or the
station” into the prompt. A much more directed way to differently colored windows.
influence queries is the use of word weights, image refer- After unsuccessful attempts to add a paved path
ences and parameters, which add additional parameteri- from the sideway to the entrance through inpainting in
zation to the prompt system. DALL·E, we transferred image (f ) to Stable Diffusion.
Fig. 4 The proposed combined workflow over Midjourney, DALL·E and (optionally) Stable Diffusion
Ploennigs and Berger AI in Civil Engineering (2023) 2:8 Page 9 of 11
(a) Midjourney original query (b) Variants (c) Remastered and Upscaled
(g) Stable Diffusion inpainted (h) 2nd story variant 1 (i) 2nd story variant 2
Fig. 5 Refinement and variant generation in Midjourney (a–c), DALL·E 2 (d–f), and Stable Diffusion for a walkway (g) and a second story (h, i)
Ploennigs and Berger AI in Civil Engineering (2023) 2:8 Page 10 of 11
(a) “architectural floor plan of a (b) “red brick building facade (c) “a drawing of a map of five
modern single-family home, with a dutch clock gable and a buildings arranged around a
ground floor” bow window” park, realistic, detailed,
top-down, site plan, digital
rendering”
Fig. 6 Example failure cases
themselves. The specific architectural element “bow win- the tools are already adopted. In the coming months
dow” was turned into a window with a bow over it. The and years these platforms will further improve. We can
“clock gable” was represented by a different gable element already see workflows across tools that converge toward
with a clock below it. Case (c) shows a query for a land- Fig. 4. In the end, single platforms will deliver the full
scape design with five buildings, which invokes the com- workflows for use cases like ideation, collages, build and
mon problem that these AI tools are just bad at counting style variants that will drastically improve productivity
and complex spatial arrangements beyond foreground and creativity. It is thus likely that these tools will first
and background. be adopted for brain storming sessions with clients and
for competitions. We observed that the designs created
6 Conclusion by current AI tools tend toward organic forms, openwork
In this paper we looked at AI art generation tools in their facades and complex arrangements that break out of the
applicability to architecture and civil engineering. We common minimalistic modern design. With the ongoing
compared three of the currently available AI art plat- research on automated evaluation of structure dynamics
forms and identified use cases that can be tackled now and in the field of additive and robotic construction tech-
or are soon to be unlocked. To understand how users are nologies, more and more of these designs are becoming
already using these tools we analyzed millions of queries structurally and financially possible. This may form a per-
providing some insights on how users iterate. Finally, fect storm situation leading to a new generation of archi-
we presented two workflows, for interior and exterior tecture styles based on AI-generated designs.
design, with the latter combining the strengths of the dif-
ferent platforms.
Author contributions
The various use cases shown in this paper illustrate the JP: Ideation and manuscript, Midjourney Study, Writing Sects. 1, 3, 4. MB: Use
strong potential for AI tools in architecture. The AI plat- Case Experiments, Writing Sects. 2, 3, 5. Both authors reviewed and edited all
forms still struggle with more complex prompts, usually sections as well as read and approved the final manuscript.
due to missing semantic understanding of the image con- Funding
tent. A floor plan for example is not just a collection of No funding was received for conducting this study.
lines. These lines carry semantic and contextual informa-
Data availibility
tion, like that they form walls enclosing a room with a door The Midjourney datasets generated and analysed during the current study are
to get in. As we already have Building Information Models not publicly available for data privacy reasons. Code for retrieving the data can
(BIM) that provide this semantic information it is just a be made available from the corresponding author on reasonable request.
matter of time that new diffusion models will arrive that are
trained specifically on these data sets, and it is likely that Declarations
they will be able to fulfil all requirements from Table 1. Competing interests
Nonetheless, the high number of 5.7 million queries The authors have no financial or proprietary interests in any material discussed
with architectural context that we identified show that in this article.
Ploennigs and Berger AI in Civil Engineering (2023) 2:8 Page 11 of 11
References
Borji, A. (2022). Generated faces in the wild: Quantitative comparison of stable
diffusion, midjourney and DALL-E 2. arXiv preprint http://arxiv.org/abs/
2210.00586.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neela-
kantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language
models are few-shot learners. Advances in Neural Information Processing
Systems, 33, 1877–901.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models.
Nips, 33, 6840–6851.
Ho, J., Salimans, T., Gritsenko, A.A., Chan, W., Norouzi, M., & Fleet, D.J. (2022).
Video diffusion models. ICLR workshop on deep generative models for
highly structured data.
Kawar, B., Zada, S., Lang, O., Tov O, Chang, H., Dekel, T., Mosseri, I., & Irani, M.
(2022). Imagic: Text-based real image editing with diffusion models. arXiv
preprint http://arxiv.org/abs/2210.09276.
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., & Van Gool, L. (2022).
Repaint: Inpainting using denoising diffusion probabilistic models. CVPR
(pp. 11461–11471).
Luo, S., & Hu, W. (2021). Diffusion probabilistic models for 3D point cloud
generation. CVPR (pp. 2837–2845).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed
representations of words and phrases and their compositionality. Nips
(Vol. 26).
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever,
I., & Chen, M. (2022). Glide: Towards photorealistic image generation and
editing with text-guided diffusion models. ICML.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G.,
Askell, A., Mishkin, P., Clark, J., Krueger, G. (2021). Learning transferable
visual models from natural language supervision. ICML (pp. 8748–8763).
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical
text-conditional image generation with CLIP latents. arXiv preprint http://
arxiv.org/abs/2204.06125.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022, June).
High-resolution image synthesis with latent diffusion models. CVPR (p.
10684–10695).
Saharia, C., Chan, W., Chang, H., Lee, C., Ho, J., Salimans, T., Fleet, D., & Norouzi,
M. (2022). Palette: Image-to-image diffusion models. ACM SIGGRAPH (pp.
1–10).
Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., & Norouzi, M. (2022). Image
super-resolution via iterative refinement. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
Seneviratne, S., Senanayake, D., Rasnayaka, S., Vidanaarachchi, R., & Thompson,
J. (2022). DALLE-URBAN: Capturing the urban design expertise of large
text to image transformers. International Conference on Digital Image
Computing: Techniques and Applications.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep
unsupervised learning using nonequilibrium thermodynamics. ICML (pp.
2256–2265).
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., & Bermano, A.H. (2022).
Human motion diffusion model. arXiv preprint http://arxiv.org/abs/2209.
14916.
Zeng, X., Vahdat, A., Williams, F., Gojcic, Z., Litany, O., Fidler, S., & Kreis, K. (2022).
LION: Latent point diffusion models for 3D shape generation.
Zhou, L., Du, Y., & Wu, J. (2021). 3D shape generation and completion through
point-voxel diffusion. CVPR (pp. 5826–5835).
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in pub-
lished maps and institutional affiliations.