0% found this document useful (0 votes)

17 views38 pages

Towards AI-Architecture Liberty: A Comprehensive Survey On Design and Generation of Virtual Architecture by Deep Learning

This survey explores the integration of deep learning techniques in the design and generation of virtual architecture, highlighting the collaboration between designers and AI. It reviews 149 articles, identifies current challenges in virtual architectural design, and proposes four research agendas to enhance designer-AI interaction. The work aims to broaden access to virtual architecture creation, advocating for interdisciplinary efforts to improve content generation in immersive environments.

Uploaded by

bhanump777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views38 pages

Towards AI-Architecture Liberty: A Comprehensive Survey On Design and Generation of Virtual Architecture by Deep Learning

Uploaded by

bhanump777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Towards AI-Architecture Liberty: A Comprehensive Survey on Design and

Generation of Virtual Architecture by Deep Learning

ANQI WANG, Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology, Hong Kong SAR
and Computational Media and Arts, Hong Kong University of Science and Technology (Guangzhou), China
JIAHUA DONG, The Chinese University of Hong Kong, Hong Kong SAR
LIK-HANG LEE, The Hong Kong Polytechnic University, Hong Kong SAR
JIACHUAN SHEN, The Bartlett School of Architecture, University College London, UK
PAN HUI, Computational Media and Arts, Hong Kong University of Science and Technology (Guangzhou), China
arXiv:2305.00510v4 [cs.HC] 18 Jul 2024

and Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology, Hong Kong SAR

3D shape generation techniques leveraging deep learning have garnered significant interest from both the computer vision and
architectural design communities, promising to enrich the content in the virtual environment. However, research on virtual architectural
design remains limited, particularly regarding designer-AI collaboration and deep learning-assisted design. In our survey, we reviewed
149 related articles (81.2% of articles published between 2019 and 2023) covering architectural design, 3D shape techniques, and virtual
environments. Through scrutinizing the literature, we first identify the principles of virtual architecture and illuminate its current
production challenges, including datasets, multimodality, design intuition, and generative frameworks. We then introduce the latest
approaches to designing and generating virtual buildings leveraging 3D shape generation and summarize four characteristics of various
approaches to virtual architecture. Based on our analysis, we expound on four research agendas, including agency, communication,
user consideration, and integrating tools. Additionally, we highlight four important enablers of ubiquitous interaction with immersive
systems in deep learning-assisted architectural generation. Our work contributes to fostering understanding between designers and
deep learning techniques, broadening access to designer-AI collaboration. We advocate for interdisciplinary efforts to address this
timely research topic, facilitating content designing and generation in the virtual environment.

CCS Concepts: • Human-centered computing → Interaction design process and methods; Virtual reality; • Computing
methodologies → Machine learning; • Applied computing → Architecture (buildings).

Additional Key Words and Phrases: AI-assisted architectural design, 3D geometry generation, designer-AI collaboration, AIGC

ACM Reference Format:

Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui. 2018. Towards AI-Architecture Liberty: A Comprehensive
Survey on Design and Generation of Virtual Architecture by Deep Learning. J. ACM 37, 4, Article 111 (August 2018), 38 pages.
https://doi.org/XXXXXXX.XXXXXXX

Authors’ addresses: Anqi Wang, Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology, Hong Kong SAR and Computational
Media and Arts, Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China; Jiahua Dong, The Chinese University of Hong Kong,
Hong Kong SAR; Lik-Hang Lee, The Hong Kong Polytechnic University, Hong Kong SAR; Jiachuan Shen, The Bartlett School of Architecture, University
College London, London, UK; Pan Hui, Computational Media and Arts, Hong Kong University of Science and Technology (Guangzhou), Guangzhou,
China and Emerging Interdisciplinary Areas, Hong Kong University of Science and Technology, Hong Kong SAR.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
© 2018 Association for Computing Machinery.
Manuscript submitted to ACM

Manuscript submitted to ACM 1

2 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

Fig. 1. Virtual architecture provides rich use affordances, benefiting various aspects, including users, communities, and organizations.
(a) Hub Rooms in VRChat; (b) Simulated city scene with various buildings; (c) Interior of individual house in a virtual world
game, Second Life; (d) Saint Motel Fan Meet on Mozzila hub; (e) Sci-fi nightclub; (f) Center building in pride month event held on
Dencentraland; (g) Realism Sotheby’s virtual gallery; (h) The interactive digital experience in a virtual building for the brand Jose
Cuervo designed by Rojkind Arquitectos; (i) The office building of Vice Media Group was designed in metaverse by BIG Architect; (j)
"NFTism" is a virtual NFT gallery developed collaboratively by Zaha Hadid Architects (ZHA) and Journee. The architectural design
focuses on user experience, social interaction, and "dramaturgical" compositions, supporting MMO (massively multiplayer online)
technology and integrating audio-video interaction1 ; (k) The virtual city designed and developed by ZHA, filled with fluid style
virtual architecture; (l) Each piece of land and each item in the virtual land as a non-fungible token could trade in the virtual world of
Decentraland; (m) A mapping showing ownership of non-fungible token land in Sandbox; (n) A user-friendly tool for editing and
creating buildings and land in social VR, Mozzila hub.

1 INTRODUCTION
In recent decades, the exploration of architectural space has shifted from reinforced concrete to digital architecture
and further to information frameworks, with virtualization extending beyond the physical layer [15, 22]. Architecture
is no longer perceived as static, permanent objects but as integral components of a data network and evolving
communication between different architectural systems [31, 157]. Digital innovations and technological advancements
associated with spline, pixels, voxels, and bits have redefined architectural forms called digital architecture [82]. In
such context, two powerful forces have driven the birth of generated virtual architecture: AI-assisted design and
virtual reality (VR) technique. In recent decades, AI architecture has become a critical sub-genre of architectural
discourse advocated by scholars and practitioners [169], utilizing deep learning neural networks to concern social issues.
Moreover, the burgeoning innovates and techniques of virtual environments (e.g. virtual scenes with virtual reality (VR),
augmented reality (AR), immersive videogames, and social VR) have expanded the boundaries of production and design
in digital fabrication buildings, such as physical-virtual merging [30, 98, 102], bio-signal feedback [43, 70, 141, 142],
and participatory design in the virtual environment [88, 96]. Thus, deep learning facilitates the design and generation
of diverse virtual architectures with interactive features, supporting the promising vision of ubiquitous intelligent
environments in VR. It simulates real-world scenarios by generating and utilizing 3D building datasets, aligning design
objectives with human needs [199].
The designers creating virtual buildings with AI include broader users beyond conventional expert designers,
encompassing individuals with varying levels of expertise in problem understanding, subject matter, and machine
learning [184]. This innovative designer-AI collaboration, employed by diverse designers, could bring multi-dimensional
benefits to the virtual environment. Firstly, designers could utilize AI to create virtual buildings for efficient and
large-scale production, thereby expanding virtual environment scenarios with a variety of content [80]. Secondly,
designers could prioritize user experiences by considering spatial, social, and physical aspects of human-building

1Areport by Archdaily, source: https://www.archdaily.com/972886/zaha-hadid-architects-presents-virtual-gallery-exploring-architecture-nfts-and-the-

metaverse
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 3

interaction (HBI), such as evolving spatial forms and adaptive architectural elements [5]. Thirdly, designers could
generate virtual buildings that offer social sustainability and economic benefits by embedding computational frameworks
or workflows automatically [80], aligning with the development goals of the virtual environment. These benefits include
time and resource savings, as well as fostering continuous renewal and evolutionary capabilities. Additionally, deep
learning-generated virtual architecture provides convenient access for both experts and inexperienced users in content
production through designer-AI collaboration, contributing to the content ecosystem in the virtual environment. One
concrete example is that users could create a house or edit land using a user-friendly tool in social VR (Fig. 1n).
Deep learning’s powerful capability enables the fusion of virtual and physical realities, facilitating diverse lifestyles
and a wide range of user, community, and organizational uses (as shown in Fig. 1). Virtual buildings could enhance user
presence by immersing themselves in vivid, realistic scenes of the virtual environment [80] (Fig. 1a-c). Virtual buildings
also serve as expansive communication interfaces and containers, fostering social activities and goals in immersive
environments [157]. For instance, users require virtual buildings to hold various activities, events, and communities,
such as fan meets (Fig. 1d), club parties (Fig. 1e), and Pride Month parades (Fig. 1f). Organizations, including companies,
cultural institutions, and charity organizations, also require virtual buildings to maintain urban business premises
(Fig. 1g-i). Notably, renowned architectural firms have begun designing virtual cities and buildings in virtual worlds
(Fig. 1h-k). Furthermore, the interactive nature of virtual architecture facilitates diverse social systems. It delivers
sophisticated information regarding social interactive elements in multiple scenarios, as demonstrated by ZAHA’s
advanced virtual architectural projects incorporating interactive technology and architectural design (Fig. 1h-j). Lastly,
the virtual architecture enables trading, giveaways, and auctions of virtual land as non-fungible tokens (Fig. 1l-m),
enriching the metaverse 2 ecosystem in the future.
The early adoption of emerging technology often encounters accessibility issues, and knowledge from one field
may not be easily transferred to another [35]. In architecture, design research primarily focuses on leveraging neural
networks, such as StyleGAN and Pix2Pix, to generate images like floor plans [23, 24], exterior facades [48, 118], and
stylized images [36, 38]. However, there is a lack of effort in collaborating with deep generative models (DGMs) for direct
3D geometry generation. This oversight results in a deficiency of diversified design philosophies, efficient production
logic, and precise geometric control in current deep learning-generated architectural designs. Despite proposals for
computational and algorithmic approaches (e.g., [80, 97]), design dimensions of virtual architecture have yet to be
systematically explored. This gap impedes the advancement of virtual buildings within virtual environments. Our survey
aims to bridge this gap by investigating deep-learning-based 3D shape generation methods for virtual architecture. We
emphasize the importance of considering designer-AI collaboration perspectives, particularly inclusivity, to cater to
non-tech-savvy architects and layperson users seeking to create virtual buildings in the virtual environment.

1.1 Preamble: 3D Virtual Architecture Generated by Deep Learning

Generated virtual architecture is a vast domain spread over deep learning, computer graphics, and design. Therefore, it
is necessary to define these terms at the beginning of this survey.
Deep learning. Deep learning (DL), a subclass of machine learning (ML) and artificial intelligence (AI), has developed
rapidly with a boost in data process and computation. They rely on paradigms of unsupervised learning. As signature
deep learning architectures, neural networks such as ANN, CNN, and RNN have played essential roles in manipulating
the relationship between the input and output data. The definition of deep learning signifies the system masters the

2 The metaverse refers to a hypothetical, computer-generated world that integrates with reality through various techniques and concepts [99].
Manuscript submitted to ACM
4 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

Fig. 2. Applications of deep learning impact our lives in all aspects. (a) Self-drive cars; (b) AlfaGO; (c) Segmentation in the city
recognition with computer vision; (d) Mirror World NFT, which shows AI dialogue character with personality and development
from learning; (e) OpenCV recognizing the object types in the camera view; (f) Apple watch paired with deep learning detect atrial
fibrillation with 97 % accuracy; (g) ChatGPT developed by OpenAI; (h) Recommendation system in the Tiktok; (i) Smart agriculture
implemented by deep learning with drones; (j) DALLE-2, one powerful painting tool empowered by machine learning; (k) D.O.U.G, a
collaborative robotic arms interacted with human, learning human behaviors and gestures, performance and created by artist Soug
Wen; (l) digital human body generation by 3D reconstruction technique; (m) An AI art movie created by GANs (Casey Reas); (n) BCI
(Brain-computer interface).

capability of self-learning and experience enhancement [156]. Deep learning has broad applications across various
domains of life. There are plenty of notable examples. Such as the first fully automatic self-driving car, Navlab5 (Fig. 2a);
Alpha GO, a computer program that can beat top human Go players (Fig. 2b); Mirror World NFT’s intelligent character 3
that can learn and grow up from human text conversations (Fig. 2d); ChatGPT 4 developed by OpenAI [135], the ever
first intelligent conversational machine model; One of the best recommendation systems in the world (Fig. 2h), which
made TikTok stand out from 13.47 million DAUs 5 ; DALLE-2, which was commented as the ever-best AI painting tool
(Fig. 2j); and the infinite potential BCI (Brain-computer interface) (Fig. 2n); Moreover, the intelligence revolution could
not be ongoing without DL, for instance, smart agriculture (Fig. 2i), and smart transportation. Computer Vision relies on
deep learning closely, such as notable OpenCV (Fig. 2e), segmentation with vision cognition (Fig. 2c), and 3D scanning
techniques (Fig. 2l). Additionally, loads of contemporary digital art were created through deep learning by inputting
and processing the data of human gestures and bio-signals (Fig. 2k). Figure 1m represents a cutting-edge example of
experimental AI-art films created using GAN.
3D Shape Generation Technique. It refers to creating digital representations of objects in three spatial dimensions
utilizing deep learning models (DGMs), which is a foundational task in computer graphics. Various DGMs like Generative
Adversarial Networks (GANs), Variational Autoencoders (VAEs), Flow models, and Diffusion models (DPMs) are
employed, each with distinct technical attributes. For example, 3D-GAN employs probability spaces for generating
3D objects with voxel grid representation [179]. 3D shape generation inspires downstream tasks such as object
classification, part segmentation, and scene semantic parsing [57]. Moreover, its various innovative approaches, such
as descriptive prompts [60, 131, 145], model optimization [149], multi-modal interaction [52, 145], and integrating
tools [145], collaboratively foster production and applications in design fields representing 3D models as output.
Deep Learning-Assisted Architectural Form Generation. It utilizes neural networks to generate building forms, integrat-
ing design features and corresponding purposes [199]. Its distinctive use of deep generative models (DGMs) sets it apart
from other areas in deep learning-assisted architectural design [199], such as environmental performance analysis,
and AI-assisted outcome evaluation. This approach addresses architectural, urban, and environmental challenges by
3 Mirror Wolrd’s official websites: https://link3.to/mirrorworld
4 Introducing ChatGPT, source: https://openai.com/blog/chatgpt
5 A report on TikTok, souce: https://www.statista.com/statistics/1090659/tiktok-dau-worldwide-android/

Manuscript submitted to ACM

A Survey on Deep Learning for Design and Generation of Virtual Architecture 5

domain in computer domain in computer-

vision aided design

Architecture design
generated by DGMs
DL-assited
3D generation
architectural
with DGMs
design
Virtual architecture design

3D generative model
considering virtual
environment
Rules of the
virtual world

(a) (b)

Fig. 3. (a) This survey investigates the intersection of various areas. (b) A profile of the number of cited works categorized by different
years and topics. The survey’s scope and profile of related articles: a – architectural studies on DGMs; c – computer vision studies; v –
works on rules in VWs.

employing 2D images or 3D form generation techniques. Research in this domain encompasses various aspects, such
as generative plans or sections based on spatial features [8, 150, 186, 192, 193], style transfer for preserving cultural
heritage preservation [105, 150], and the generation of diverse architectural typologies [63].

1.2 Methodology and Related Articles

As for the abovementioned problems, the intersection of 3D generation techniques and virtual architecture is still nearly
blank. Therefore, we anchored the three fields to conduct this survey to complement the key insights with each other:
3D shape generation techniques, DL-assisted architectural design, and the design considerations in VWs (see Fig. 3a).
This article collects data using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA)
framework [139].
Identification. We reviewed a sample of 149 articles and primarily focused on works published between 2019 and
2023 (five years, 81.2%) as follows: 2024: 2 (1.3%), 2023: 14 (9.4%), 2022: 25 (16.8%), 2021: 34 (22.8%), 2020: 25 (16.8%),
2019: 23 (15.4%), before 2018: 26 (17.5%) from mentioned three fields (see Fig. 3b). We found the articles primarily
through publication databases such as ACM Digital Library, IEEE Xplore, ScienceDirect, Springer Link, and CuminCAD.
These databases ensure a wide range of computer graphics, virtual environments, and architectural design scope. We
determined our search query, including the primary and second terms as searching keywords, after discussion with
inner researchers through multiple sessions of iteration and refinements. These keywords were inspired by previous
review papers. The primary keywords in the search query are straightforward, reflecting our survey topic, as follows:
"artificial intelligence (AI)" OR "machine learning (ML)" OR "deep learning (DL)" OR "Generative Adversarial Network
(GAN)" OR "3D-aware synthesis" OR "Variational Autoencoder (VAE)", "Diffusion Model (DPM)"
AND "architectural design" OR "virtual building" OR "virtual environment" OR "virtual rules" OR "design discipline"
OR "building generation" OR "virtual reality (VR)" OR "augmented reality (AR)" OR "metaverse"
We conduct these search queries aligning in publication title, abstract and keyword. To ensure the subtle and
comprehensive searching results in design fields, the second keywords include relating architectural design terms (i.e.,
AI architecture, design generation, digital architecture, form finding) or technical terms (i.e.,text-to-3D, multimodality,
semantic interpretation), as follows:
"text-to-3D" OR "image-to-3D" OR "image-to-form" OR "language-based" OR "semantics" OR "NLP" OR "multimodal"
OR "3D form" OR "form finding" OR "generative model" OR "neural network" OR "Style Transfer"
Manuscript submitted to ACM
6 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

AND "computation" OR "human-computer-interaction (HCI)" OR "evaluation metric" OR "perception" OR "bio-signal"

OR "emotion" OR "participatory design" OR "aesthetics" OR "real-time interaction"
The final database search was conducted by January 2023.
Screening, Eligibility, and Inclusion. We screened titles and keywords to ensure the inclusion of only full papers and
extended abstracts, applying the following exclusion criteria (EC):

EC1 Excluded articles solely discussing real-world architectural design aspects (e.g., natural environment, BIM,
construction, floor plans).
EC2 Excluded articles not primarily focused on 3D design generation (e.g., analysis of design cognition, limited 2D
generative drawing without 3D transition approaches).
EC3 Excluded articles primarily discussing generative approaches other than deep learning for architecture (e.g.,
reinforcement learning, genetic algorithms). Only articles discussing deep generative models for building
generation were considered.
EC4 Excluded articles discussing broad architecture design topics (e.g., the architecture of algorithms).
EC5 Excluded articles related to 3D generation focusing on the classification and retrieval of computer vision
approaches.

When the keywords and abstracts did not appear as the key information or elements within our investigative scope,
we read the entire publication to determine its inclusion. After screening, we selected 111 articles and 19 CV research
papers published on arXiv, totaling 130 authoritative articles. Additionally, we conducted direct searches through the
Google search engine, concluding with 19 articles and 2 relevant architectural projects from the perspectives of virtual
worlds, computational architecture, and architectural theory. Ultimately, this survey includes a total of 149 articles and
2 architectural projects.
Various other surveys further locate this scope, as follows: Category 1 (machine learning [95, 143] or deep learning [1,
4, 6, 6, 66, 87, 95, 146, 155]): 3D shape generation [21, 137, 165], scene synthesis [182, 195], applications [34, 49],
3D representation [63], 3D reconstruction from 2D [188], and generative models [3, 32, 33, 57, 57, 76, 140, 177];
Category 2 (artificial intelligence in the architectural design): machine learning-assisted architectural design [143, 171,
199], GAN-assisted architectural design [127], computer-aided architectural design [169], historical development [24],
infrastructure [171], intelligent construction [4, 10, 71]; and Category 3 (architecture in a virtual environment or
virtual worlds): design disciplinary in virtual reality (theories and applications) [13], human perspective in the virtual
space [70, 98], human-building interaction (HBI) [5], and metaverse or virtual worlds [14, 47, 99]. In contrast, this article
reviews the deep learning approaches in the virtual discipline of architecture regarding virtual environment rules,
techniques, design principles, and HCI methods. We advocate for an interdisciplinary approach that integrates these
research areas. Our survey article uniquely considers the prominent features of the above categories in this topic and
further paves the way for AI-architecture integration in virtual environments with the contributions below.

(1) We provide a comprehensive investigation for the inclusion of DL-assisted architectural generation and deep
generative models dedicated to developing a critical lens for computational architecture in virtual environments.
(2) We highlight four characteristics of various generation approaches to improving technical understanding and
prompting design applications by elaborating on their capabilities and challenges from the whole picture.
(3) We propose research agendas and future directions for generated virtual architecture, considering virtual
disciplines, deep learning, and architectural design, thus call for interdisciplinary designer-AI collaboration
research.
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 7

1.3 Scope and Structure

While the intersection of deep learning and architecture encompasses various aspects, including design cognition
(e.g., feature extraction, image recognition, design evaluation), design generation (e.g., floor plans, rendering, form
generation), and assisting tools (e.g., efficiency optimization, strategy prediction), our investigation focuses specifically
on articles where deep learning is applied to the generation of 3D virtual architecture, particularly utilizing 3D Deep
Generative Models (DGMs). In this context, two key scope notions warrant emphasis: Firstly, we underscore 3D
DGMs due to their potential to efficiently generate designs aligned with diverse goals, thereby enriching content
and affordances in the virtual environment. We also include architectural research that generates 2D images with 3D
transitions rather than stopping at 2D-style imaginary drawings. Secondly, our scope is confined to the generation
task, distinguishing it from similar reconstruction and completion tasks, as generation involves actively creating 3D
shapes using deep learning. This decision is informed by the prevalent application of generative tasks in architectural
design, closely aligning with human cognition. Furthermore, in line with the virtual context of our inquiry, we prioritize
articles where virtual environments can substitute physical ones rather than solely addressing real-world problems like
Building Information Modeling (BIM).
The paper begins by reviewing the current problem space in this field, which encompasses the rules of the virtual world,
3D generation techniques, and related architectural design works in Section 2. In the overview of related works, we also
emphasize four key focuses derived from the challenges identified in current literature, namely dataset, multimodality,
design intuition, and generative framework. Section 3 delves into innovative form generation in architecture, exploring
generation approaches concerning 3D form transposition and 3D solid form generation. This section discusses four
topics: GAN targeting specific training, VAE for specific information extraction, 3D-aware image synthesis, and diffusion
model based on conditional text. Subsequently, Section 4 provides a summary of the various characteristics of design
generation approaches from different DGMs and design purposes. Finally, we formulate four grand challenges and three
future directions by revisiting the key focuses from the perspective of HCI in Section 5. These challenges encompass
agency and communication between design and technique, user consideration, and adaptive tools. We emphasize that
innovative methods for automatically generated virtual architecture should prioritize human and social factors.

2 OVERVIEW OF GENERATED VIRTUAL ARCHITECTURE

We first define the virtual architecture and its design discipline to clarify our survey and key focus. Then, we briefly
introduce deep generative models and 3D representations. Last, we review existing literature on generated virtual
architecture.

2.1 Virtual Architecture and Design Discipline

Virtual worlds are enduring, computer-generated environments existing online where users interact with others in
remote physical locations in real-time, for both work or leisure [47]. Research on the virtual world has surged in recent
decades, encompassing theory, concept, and technology development as the early stage of the metaverse [47, 99, 126, 157].
Two layers form a complete, accessible, and livable metaverse. Firstly, software and hardware architecture technically
defines spatial functionality, constraints, and social interaction, which form the politics of virtual worlds [100]. Secondly,
immersion, presence, and interactivity as three pillars in virtual worlds further shape user experience [121], constituting
the essence of simulating real-life experiences and fostering self-presence, community social activities, and user
ownership.
Manuscript submitted to ACM
8 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

(a) (b) (c) (d) (e) (f)

Fig. 4. Virtual architecture projects exhibit various forms: (a) Joris Putteneers’ "Synesthesia" 6 (2016) creates a surreal and complex
architectural construction through algorithmic simulation in Houdini; (b) "E-motion" 7 (2020) by Fei Chen et al. features a digital
interface allowing real-time data interaction for rethinking co-living among species; (c) "ISOS" by Viviane Toraci Fiorella, Taza Celilia,
and Prandini Alvaro Campo in the "Volumetric Cinema" workshop by Current.CAM (2022) explores the interaction between virtual
avatars and dramatic environments; (d) Tane Moleta and Mizuho Nishioka’s collaborative project (2021), "Populating Virtual Worlds
Together," leads to an autonomous virtual space; (e) Utilizing BCI to capture affective-driven dynamic noise, Barsan et al. (2020)
demonstrate 3D volumetric architecture in virtual environments [12]; (f) Current.CAM’s VR gallery (2021) features continuous
partitioned spaces with fluid blue, reinforcing the shaping of digital interfaces on the human senses.

2.1.1 Virtual Architecture: Mission and Scope. Virtual architecture is a spatial instance within the virtual world. It
is characterized by interactive features related to social attributions and technology frameworks, including realism,
ubiquity, interoperability, and scalability [47]. The concept of virtual architecture encompasses various buildings and
scenes with immersive experiences, such as video games and social VR.
Our scrutiny reveals an overlapping scope between virtual architecture and human-building interaction (HBI),
referring to a consistent goal between the social acceptability of metaverse [99] and sustainability of HBI [5]. Specifically,
HBI focuses on understanding and facilitating human interaction, adaptation, and emotional connection within the built
environment [5], whose sustainability highly echos the social acceptability of the metaverse. Furthermore, researchers
have built a close correlation between these two notions relying on user perception in immersive virtual environments
aiming at HBI goal [68]. Additionally, according to conceptualizing HBI by Alavi et al., the built environment comprises
social, spatial, and physical aspects [5]. This work also emphasizes that significant technologies driving HBI include
ubiquitous computing, interactive architecture components, and sustainability. Notably, previous HBI surveys emphasize
the inherent attributes of buildings, distinguishing them from traditional HCI [5, 125]. Therefore, the HBI perspective
offers a new approach to understanding the multifunctional nature of virtual buildings [5], regarding encapsulating
ubiquitous computing, human-centric interaction, and sustainability goals. We outline these essentials to define the
mission and scope of virtual architecture in Table 1 (Appendix).

2.1.2 Design Discipline. Virtual architecture gives rise to new design disciplines distinct from traditional architecture
that align with the logic of virtual worlds. The below paragraphs elaborate on the design considerations, including
building form, production, and construction methods, as well as the pivotal role of deep learning approaches within
these design disciplines.

(1) Design Consideration. Deep generative models for architectural generation could easily replace the physical
context in current research with a virtual one in the following two aspects. First, virtual logic (rendering,
computation, graphics, society codes) replaces physical context (i.e., environment, material, economic cost),
where users emphasize social gathering activities with unrestricted time and space. Second, virtual logic takes
into account more aesthetic, cultural, and human-centered intentions than physical architecture [12]. Virtual

6 Source: https://putteneersjoris.xyz/projects/synesthesia/synesthesia.html
7 Source: https://bproautumn2020.bartlettarchucl.com/rc18/e-motion
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 9

Fig. 5. The frameworks of GANs, VAEs, and Diffusion Models.

architecture can take this grand responsibility, utilizing deep generative models to incorporate wide definitions
and creativity affordances.
(2) Building Form. First, deep generative models could form various formats, from solid mesh to the discrete
point cloud, to support virtual architecture with flexible and inclusive representations. Generation and virtual
techniques enable the construction and rendering of cutting-edge buildings in the virtual world by replying to
proper datasets, such as Paper Architecture 8 , and bionic architecture 9 , which previously not be implemented in
reality. Second, leveraging deep generative models could form a diversified and rich shared online space since
this technique not only incorporates the vast and various datasets contributed by users but also flexibly adjusts
its framework or workflow. Space geometry has been expanded to space intelligence, representing the generative
logic of forms transformatively. For instance, space form could be embedded in technical capabilities as assistance,
such as algorithmic fluid form (Fig. 4a, 4f), spatial data with point cloud (Fig. 4b), interactive narrative (Fig. 4c),
collaborative co-constructing (Fig. 4d), and creative voxel girds output (Fig. 4e).
(3) Production Mode. Undoubtedly, deep learning with 3D generation capability facilitates novel production modes.
First, the construction mode has changed into modeling form from 3D space with modeling software assisted
instead of the traditional floor plan of function layout in architectural design. Second, the process of designing
and building virtual architecture has changed. ZHVR group illustrates this new construction process as five
phases: concept design, developed design, technical design, construction, and post-construction 10 . Meanwhile,
the stakeholders have changed in this process; architects and programmers have become architects and engineers.
By highlighting these key points, we craft a virtual architecture guideline urging a broad understanding grounded in
knowledge and expertise for virtual buildings.

2.2 Deep Generative Models and 3D Representations

We briefly overview the progression of deep generative models for 3D representation, including 3D shape generation
and 3D aware image generation.

2.2.1 3D Shape Generation. Traditional deep generative models contribute to 3D shape generation alongside well-
known models such as Generative Adversarial Networks (GANs) and variational autoencoders (VAEs). Other models
like normalizing flows (N-Flows), recent diffusion probabilistic models (DDPMs), and energy-based models (EBMs)
have also made significant contributions by learning from the similarity of given data. These deep generative models
8 Visionary architecture that could not be built in reality, only as drawings, collages, or models.
9 Bionic architecture is usually computationally adapted to the structure or form of organic matter in nature. Design considerations for biomimetic
architecture include living organisms’ physiological, behavioral, and structural adaptations.
10 Reference the diagram named Redefinition of AEC project ecosystem by ZHVR

Manuscript submitted to ACM

10 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

Fig. 6. The figure illustrates various 3D representations commonly used in generated architecture: (a) Voxel grids, (b) Meshes, (c)
Point clouds, and (d) Neural fields.

generate a tangible 3D object ready for rendering. It conveys a latent variable to a high-quality image. Although each
model has benefits and has shown remarkable progress in recent years, the architecture domain primarily relies on
GANs. In contrast, VAEs and the latest diffusion models are designed for a few research. Considering the relevance, we
thus introduce the GANs, VAEs, and diffusion models in detail rather than an exhaustive list of models included in
other CV survey articles (Fig. 5).
GANs. GANs are a type of semi-supervised learning relying on the noise value. The GANs rely on machine learning
algorithms to construct two neural networks: one is a generator, and another is a discriminator. It trains a large database
employing a zero-sum game between two neural networks to generate agnostic creative results.
Variational Autoencoders. Variational autoencoders are probabilistic generative models that employ neural networks,
utilizing both encoders and decoders alongside inputs and outputs. The latent space of a VAE learns data features and
simplifies data representations to aid model training for a specific purpose. During training, regularization ensures
that the latent space of a VAE possesses acceptable qualities and can generate new data [91]. Furthermore, the term
"variational" originates from the tight connection between regularization and the variational inference technique used
in statistical analysis.
Diffusion Models. By modeling the dispersion of data points in latent space, we can uncover the underlying structure
of image or volumetric datasets, exemplified by Denoising Diffusion Probabilistic Models [69]. This entails training a
neural network to eliminate the blurring effect of Gaussian noise on an image, resulting in the significant advantage of
producing sharp and detailed features.

2.2.2 3D-Aware Image Synthesis. This approach extracts latent vectors from the latent space and decodes them into a
target representation by using GAN. Generally, the generation pipeline is for an image with 3D awareness as a result,
and it also starts with an image as a generative source.

2.2.3 3D Representations. These two types of 3D generation yield diverse representations of 3D scenes in computer
vision and computer graphics. The 3D representation in 3D shape generally includes explicit representations, such
as voxel grids, point clouds, meshes, and implicit neural fields. A 3D-aware image includes depth or normal maps,
voxel grids, neural fields, and hybrid representations. The integration between them and the architecture generation
also varies. For example, a point cloud is often considered an input source for training generative models. The 3D
representation is articulated in existing survey research as a classification (Fig. 6). Below are brief descriptions.
Architectural design prefers explicit representations due to the controllability, familiarity, visualization, and availabil-
ity regarding modifying in 3D modeling software. Explicit geometric representations are more accessible to visualize
and interpret as they directly represent 3D space. The designers can precisely position and adjust each point or voxel,
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 11

allowing for more accurate control over the shape and form of the generated geometry. Nevertheless, implicit represen-
tations (neural fields) have considerable possibilities in architectural research regarding their benefits to offer more
flexible, continuous, and efficient representations of geometry.
Voxel grids. It refers to a three-dimensional grid of values organized into rows and columns. The grid contains rows,
columns, and layer intersections, referred to as a voxel, i.e., a miniature 3D cube [40].
Point clouds. A point cloud is a distinct collection of data points in space [151], which might indicate a three-
dimensional form or item through Cartesian coordinates (X, Y, Z) assigned to each point location.
Meshes. A 3D mesh is the polygonal framework upon which a 3D object is built [131]. Reference points along the X,
Y, and Z axes describe the height, breadth, and depth of a 3D mesh’s constituent forms. It is important to note that
creating a photorealistic 3D model sometimes requires many polygons.
Neural fields. It creates images by using traditional volume rendering methods to query 5D coordinates along camera
rays and projects the resulting colors and densities onto a 2D plane. Despite its use of depth data, the scene geometry is
rendered in exquisite detail, complete with intricate occlusions [116].
Hybrid representation. This refers to a hybrid pipeline of 3D representation for the pre-training in a 3D feature space
embedded in both the virtual and actual worlds. The hybrid pipeline can include multitudinous data sources and image
frame features [162], depending on the generation purposes of the 3D volumetry.

2.3 Generated Virtual Architecture with Deep Learning

Generating virtual buildings is an urgent and timely topic for virtual architecture since the traditional design process has
shifted due to AI intervention. As our scope mentioned, we only investigate design generation through deep learning
in the following sections. We summarize the current six approaches of 3D generation according to design process
modality from both 3D transposition and 3D solid form generation (see Fig. 7).

2.3.1 Related Works. Integrating deep learning into architectural design has significantly increased in recent years.
Initially, neural networks in deep learning were widely utilized for style transfer in 2D drawings of architectural
design (depicted as 3D transposition in Fig. 7), such as floor plans [23, 24], sections [186], and facades [48, 118]. Most
approaches employ a zero-sum game of these neural networks to blend architectural styles with design intentions and
then manually assemble these images into 3D buildings as a post-processing step [79]. However, this indirect method
poses significant challenges for designers due to the absence of interpretative embeddings in each image, such as
materiality, style, proportions, or program representation [60]. Additionally, a general lack of knowledge and expertise
in generation techniques regarding frameworks and algorithms in the architectural field leads to underexploration in
the design.
Recent works have shifted focus from 2D images to 3D shapes, utilizing techniques such as 3D GANs, VAEs, and 3D
diffusion models (depicted as 3D solid form generation in Fig. 7). Current approaches support architectural generation
from different modality inputs, such as text, image, and 3D, leveraging different model features. Generally, GAN utilize
labeled 3D datasets according to design meaning, such as architectural elements. VAEs generate 3D models utilizing
reduced 3D features recorded in latent space with an encoder-decoder. Although the application of 3D diffusion models in
architectural design is limited, their multi-modal integration capability and diverse methods suggest significant potential
for future use. Additionally, delving into 3D generation requires a higher technical threshold than simply applying
generative models. For instance, in the generation process, GANs require careful adjustment of hyperparameters for
stability, network depth, width, and kernel size to ensure performance and output quality. Additionally, considerations
Manuscript submitted to ACM
12 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

of computational resources, dataset characteristics, and random noise input to the generator are essential for effective
training [120].
Both 2D and 3D methodologies shed light on the evolving trend towards designer-assisted AI workflows rather
than mere AI-assisted design. Shi et al. underscored the importance of regulation and training by designers in such
designer-AI collaboration, alongside ideation, creation, visualization, and testing [163]. They outline four possibilities
regarding regulation and training for better aligning with design goals: curating training datasets, providing semantic
information, regulating AI’s roles and behaviors, and assisting AI in improving performance [163]. Architectural design
experiments have also propelled such pioneering endeavors. For example, Huang proposed an experimental approach in
semantic analytics through human interpretation [72], facilitating mutual learning between humans and AI to generate
cultural and architectural meaning. Overall, this shift enables designers to contribute to future architecture in virtual
and physical realms by transitioning from traditional design roles to "technical design."
We identified four key focuses by thoroughly examining the literature on architectural design employing deep
learning: dataset, multimodality, design intuition, and generative framework (see Table 2, Appendix). Notably, our focus
is specifically on architectural design and generation considering 3D form, aligning with the scope of our survey.

2.3.2 Dataset. Dataset preparation comprises data collection, data cleaning, and filtering, which is the first and foremost
step in deep learning-assisted architectural design. However, 3D building datasets regarding specialized design and
general applicability are currently scarce and underexplored. Despite being scrutinized, we found only three building
datasets that can aid in 3D form generation [120, 134, 144]. Selvaraju et al. first introduced BuildingNet, a large-scale
collection of 3D building models with labeled exteriors [160]. Building upon this work, Mueller developed the HouseNet
dataset, a selective collection of preprocessed models, which was subsequently utilized in her design work [120]. Besides,
Peters et al. introduced the 3DBAG dataset, containing 3D building models of the Netherlands [144], featuring multiple
levels of detail. The 3DBAG dataset was further leveraged by Vesely in deep learning-assisted architectural design
utilizing heatmap data for generating 3D geometry building [134], showcasing its potential in design.
Therefore, architectural design research has to prioritize building 3D datasets through manual pre-processing before
the generation process. This includes generating models using various tools, such as generating massive models
approach [158], scanning with specialized scanners [20], segmenting architectural elements [20], and formatting and
transforming data [120]. The lack of adequate and unified datasets hampers further exploration of broad user affordances
and democratized tool development.

2.3.3 Multimodality. Multimodality leverages natural human communication capabilities by integrating multiple
modalities, such as speech, gestures, and visuals, within a single channel. The current literature underscores the
significance of multimodality in deep learning methods, particularly in utilizing human understanding as input, such as
descriptive prompts and stylized images. Generally, these diverse design methods can be classified as follows, which
can also be divided into two groups: 2D and 3D generative approaches with deep learning (3D transposition and 3D solid
generation as shown in Fig. 7):
Text-to-Image with Understanding and Interpretation. This method utilizes generative images from text inputs,
such as keywords or descriptive languages, then manually translates images to 3D form by designers with design
understanding and interpretation. This process could generate more accurate and meaningful architecture synthesis
from images. It could also expand existing design cognition by style transfer, such as imaginary visual representation.
In addition, generative images in architectural design usually inspire to aid ideation, form finding, and result rendering.
However, advanced integrated models only rely on non-architectural datasets, such as DALLE2, Stable Diffusion, and
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 13

Fig. 7. 3D transposition leverages 2D generative images to transpose into 3D form in an additional manual or computational way. In
contrast, 3D solid form generation leverages 3D deep generative models, including 3D GANs, VAEs, and 3D diffusions, to generate 3D
form directly.

CLIP integrated with G. Diffusion, which leads to stiff outcomes lacking design intention. For instance, the generative
images fail to interpret further architectural meanings, such as "ecological architecture" or "parametricism" [60].
Text/Image-to-Image with Vision and Sights. This method combines visual assembly of inputting 3D models as
an image instance with the text-to-image method to render buildings. It only renders images as reference outcomes,
not implementing 3D models directly. This method enables a quantitative understanding of architecture and grants
designers greater agency [60]. Designers manually convert architectural context (e.g., massing areas, building code, site
constraints, and program distribution) into 3D form images as input. Although currently employed in real contexts, its
applicability in virtual architecture as 3D form input is acknowledged regarding context and norms. In this method,
different models offer designers varying flexibility in adjustments. For instance, Stable Diffusion permits adjusting
model parameters and image settings (e.g., image size, clip guidance scales, seed), while DALLE2 does not.
Text-to-Image-to-3D. This method conducts a pipeline combining text-to-image with high-automatic 3D computa-
tion by extracting 2D depth information from generative images. Such 3D computation approaches encompass various
approaches, such as NeRF, LeRes, and Grasshopper. Current works display the generation of building interiors and
exteriors considering various conditions. The latest research developed an integrating tool hosted in Grasshopper to
empower architectural design to integrate this mixed approach [60].
Text-to-3D with Language. Advanced deep learning models enable 3D shape generation through descriptive
language input, such as 3D Diffusion and Point-E. However, this method has not been widely utilized in architectural
design due to a lack of architectural element detail, as our investigation revealed. We delve deeper into this topic in
Section 3.

2.3.4 Design Intuition. Design intuition refers to processing generative techniques to cooperate with intended design
goals [20, 93], determining the architectural meaning of design outputs through extended interpretability. Although
many factors can affect the design results aiming to enhance design intuition, we list the two most mainstream ones:
Generative images could be tailored to the design purpose in 3D transposition approaches by incorporating the desired
Manuscript submitted to ACM
14 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

Fig. 8. A systematic taxonomy for reviewing generation approaches on virtual architecture design with DGMs.

style. For instance, Ren et al. utilized a StyleGAN to blend the Gothic architectural style with an original plan by
inputting the real essence of a Gothic-style floor plan [150].
Preprocessing datasets could also facilitate the design intuition in both 3D transposition and 3D solid form approaches
[20, 93], which ensures the dataset contains the necessary and accurate architectural elements imbued with "design
meaning." However, the current approach often involves preprocessing data by designers according to specific design
requirements, demanding significant human and computational effort to drive interpretability toward design objectives.
For example, Çakmak conducted architectural element segmentation on scanned point clouds to generate meaningful
components with a GAN and an encoder-decoder architecture [20].

2.3.5 Generative Framework. This refers to adapting generative frameworks regarding workflows and generative
models to correspond to specific design objectives. Rather than simply applying existing models, some recent research
endeavors are integrating, adjusting, and developing frameworks. For example, Kahraman et al. investigated the post-
integration workflow of a flexible, customized auxiliary discriminator network to enable multi-objective control of
generation, catering to the combination of multiple abstract criteria for design purposes [83].
Furthermore, some studies have shown promise within generative frameworks in connecting 3D geometries and
algorithmic manipulation, as well as merging geometry utilizing latent space [114, 158]. This exploration involves
investigating hyperparameters and loss functions in GANs while variable manipulation in the latent space of Variational
Autoencoders (VAEs). For instance, Mueller examined various hyperparameter options and combinations in GANs to
understand their impact on generating meaningful single-family home buildings [120]. In contrast, Sebestyen et al.
delved a VAE deeper into manipulating geometry by adjusting variables in latent space and incorporating semantic
information to achieve intentional output results [158].

3 GENERATED 3D ARCHITECTURE: A PARADIGM SHIFT

To examine the gap between the two fields and the potential advantages of each approach for virtual architecture, we
investigate various generative approaches using different DGMs across the fields of CV (Fig. 8) (Table 3 and Table 4,
Appendix) and architecture (Table 5, Appendix). Since the existing related architectural works do not systematically form
solid design philosophies, we elaborate on these works following their emphasized generative approaches, including 3D
transposition (Section 3.1), generation by GANs (Section 3.2), VAEs (Section 3.3), 3D-aware image synthesis (Section 3.4),
and diffusion models (Section 3.5). The first approach relies on additional transposition from generative images, while
the others generate geometry directly.
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 15

3.1 Constrained Approaches Using 3D Form Transposition

In the past, DGMs have primarily focused on 2D image processing tasks, such as texture synthesis [54], style transfer [84],
photorealistic image generation [131], and text-to-image synthesis [131]. Architectural research has similarly centered
around 2D to 3D transposition using image-to-image translation networks like Style Transfer [105, 138, 150, 192],
StyleGAN [38, 84, 192, 194], Pix2Pix GAN [42, 186], and CycleGAN [48].

3.1.1 3D Form Transposition. This approach converts generative images into 3D representations through computational
calculations or manual adjustments. The process starts by segmenting a 3D model into discrete images, such as sections,
plans, and projections from various viewpoints. For example, Zhang and Blasetti combined two architectural forms
using generated 2D-pixel images through discrete section slicing and multiple views [192]. Compared to direct 3D
approaches, this method offers easier usability and higher computational efficiency as it is based on generative images.
Research in this approach utilize traditional architectural design principles to manipulate spatial features in 2D
images on the plan [8, 134], section [11], facade [37, 39], and perspective images [72, 89]. For instance, Asmar and
Sareen employed a StyleGAN to leverage latent space for image generation, followed by 3D voxelization using vector
arithmetic, with a specific focus on generating a single building [8]. Vesely also utilized Pix2Pix GANs while generating
building massings within urban environments by training on high-fidelity 3D building models 11 , and incorporating
multi-layered site context information [134]. Whereas Bank et al. experimentally developed building forms with 3D
point clouds from sampling generative images aiming for teaching methods [11]. In contrast, some research utilize pixel
projection to convert 2D-pixel values, generated by latent walks, into a corresponding 3D model. [37, 39]. In these works,
Del Campo et al. innovatively generated a house emancipating from a canonical AI-generative approach by intentionally
challenging curve-fitting techniques and embracing data scarcity or overfitting [37]. In contrast to prior approaches,
some researchers utilize latent space for downscaled perspective information storage, condensing 3D data into 2D images
for manipulation. For example, Huang et al. generated rendering results with various perspectives by interpolating
images generated from the latent space of GAN [72]. This approach extends prior research by incorporating a rotation
of latent space for 3D perspective drawing [89]. This reflects a critical focus on spatial perception in DL-assisted
generation, following architectural principles, and interpreting scale from various perspectives.
The 3D transposition approach exemplifies how freedom of computation and extensive manipulation shape the
future of virtual architecture with its focus on intricate variations. Given that generative images are lightweight and
easy to process, this method ensures high-resolution retention in both input and output. However, this method sacrifices
geometry controllability and production efficiency for a generation. Additionally, significant human labor is required to
establish an automatic workflow. These shortcomings have urged a paradigm shift in generative approaches towards
direct geometry production.

3.2 Approaches Using GANs

GANs generate 3D form through training distributions by playing zero-sum games with uncertain but high-quality out-
put. Thus, they lead to typical applications in testing the agencies of AI, opting for conceptual design, and democratizing
design.

3.2.1 3D Shape Generation with GANs. GANs have emerged as a 3D generation method from a probabilistic space in
convolution networks, capable of producing various explicit representations, including point clouds [19] or voxel grids
11 The dataset is 3D BAD. An overview of 3D BAG, source: https://docs.3dbag.nl/en/
Manuscript submitted to ACM
16 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

[45, 56, 62, 104, 110–113, 133], as well as implicit neural functions like occupancy fields and signed distance functions
(SDF). Wu et al. employed a generative adversarial network architecture to generate 3D voxel grids by capturing the
probability distribution of 3D shapes [179]. Although approaches like PLATONICGAN and IG GAN also produce 3D
voxel grid models from unstructured 2D image data using GAN [67, 111], the drawback of this method for voxel grids
is the inability to achieve fine-grained voxels due to the cubic increase in computational cost. Additionally, point cloud
representation, obtained as raw data through depth scanning, presents various challenges in GAN-based generation,
including convergence issues, utilization of local contexts, and high memory consumption [2, 7, 74, 148, 166]. Mesh
representation, commonly used in 3D modeling software, faces difficulties in applying deep generation models due
to non-Euclidean data and challenges in connecting mesh vertices to compose shapes [165]. While methods like the
multi-chart approach address irregular mesh structures [16], approaches like Get3D enable high-quality geometry and
texture from 2D image collections by incorporating differentiable surface modeling and rendering into GANs [53].

3.2.2 3D GAN Utilized in Architecture. This refers to generating 3D solid form through a direct 3D data acquisition,
evaluation, transformation, and rearrangement using deep generative models [168]. This approach, aligning training
datasets with spatial design parameters, usually achieves specific design objectives. Specifically, designers utilize datasets
with design meaning by labels or semantic interpretation. For instance, Immanuel Koh employed 3D GAN to generate
massive Singapore high-rise buildings by labeling building exteriors and interiors of training models [93]. Additionally,
researchers and designers have fully probed dataset forms related to architectural structures to address the deficiency
in generating precise geometries. Zheng and Yuan innovated training coordinate vectors of building surfaces in neural
networks to find forms based on NURBS 12 with different design features [196].
Innovating generative methods using 3D GANs for fidelity and accuracy remains a novel and timely topic in AI-
assisted architectural design. Some researchers integrate additional computational methods or generative models with
3D GANs to expand the range of design cases for 3D form generation. For instance, Puteneers combined Houdini
with 3D GANs as a form-finding tool to test the algorithms’ capabilities in creating artifacts as AI agency 13 . Similarly,
Cakmak added a pair of encoder-decoders to a GAN framework to process and save data for generating 3D models [20].
This approach aims to extend design cognition by incorporating AI as an agent to understand the mind in cognitive
science through embodied action.
Researchers have also increased their focus on training and deployment techniques for design purposes. For instance,
Mueller developed a new GAN architecture capable of generating building geometry with incorporated structural
features in the dataset [120]. This work emphasizes testing different inputs for the generator and discriminator,
analyzing the training process, and exploring various hyperparameters. Such progress demonstrates a commitment to
autonomously generating building geometry using 3D GANs, leveraging automation and advanced technologies to
address architectural and urban design challenges.
Although GANs are widely recognized as the primary tools for deep-learning-assisted architectural generation, they
exhibit limitations in architectural design applications, including issues with style variation, category singleness, design
unpredictability, and topological inconsistency. These challenges drive designers and researchers to explore diverse
alternative approaches in architectural design.

12 Non-uniform rational basis spline, a mathematical model using basis splines to represent complex curves, surfaces, and solid form.
13 Source: https://putteneersjoris.xyz/projects/Ugly%20Stupid%20Honest/ugly_stupid_honest.html

Manuscript submitted to ACM

A Survey on Deep Learning for Design and Generation of Virtual Architecture 17

3.3 Approaches Using Latent Space of VAEs

VAEs result in an accurate 3D generation but lower resolution, compared to GANs’ uncertain while high-quality output.
Thus, their applications are commonly generated for different criteria and scenarios against visual output. For instance,
Azizi et al. generated reliable and plausible architectural compositions by a VAE through encoding and decoding
information about people’s spatial movements and activities [9].

3.3.1 3D Shape Generation with VAEs. VAEs utilize pointwise loss to find a probability density by explicit representations
to obtain an optimal solution by minimizing a lower bound on the log-likelihood function. Brock et al. pioneered
the first VAE for processing 3D voxel grids to address the instability issues inherent in GAN approaches [18]. This
method employs a pair of encoder-decoder architecture: the encoder, comprising four 3D convolutional layers, maps
information to latent vectors, while the decoder reconstructs these vectors into 3D voxels. Subsequent work has focused
on enhancing the quality of voxel reconstructions, particularly in producing smooth rounded edges [117]. For point
cloud representation, the challenges associated with GANs’ instability have prompted the development of alternative
generative models based on encoders. Variational autoencoders (VAEs) and adversarial autoencoder models (AAEs)
have emerged as viable alternatives [189]. Similarly, generating meshes with VAEs presents challenges akin to those
faced with GANs. Given the complexity of processing topology, approaches like the multi-chart parameterization of
meshes have been proposed to handle irregular mesh structures [16]. Several methods have sought to simplify this
process, including SDM-Net-based approaches like TM-Net, which defines a textured space on a template cube mesh
[54, 55, 123].

3.3.2 VAEs Utilized in Architecture. For research on architectural generation, VAEs extract information through the
latent space with a pair of encoder-decoders, which is compatible with various design purposes but is seldom utilized
in architectural design due to high technical access. As one pioneering research of 3D generation, Miguel et al. used
labeled connectivity vectors extracted from rectangular cube structure "3D-canvas" as data representation because of
the usability and compact regarding 3D feature structure [114]. The outcome is achieved by 3D voxelized wireframes of
multiple architectural forms with different styles and strengths through learning continuous latent distributions of VAE.
Based on this work, Miguel et al. further tested parametric augmentation for a larger dataset of 3D geometries [115].
By contrast, some work concentrates on post-integration assisting in training according to the designer’s preference
after the VAE generation in the whole production. For instance, Kahraman et al. produced a series of chairs with 3D
voxelization while comparing multi-object VAE and GAN [83]. This application discussed pre-defined criteria ranging
from leisure to work scenarios.
Although VAEs are highly capable of incorporating 3D features, their current application is limited to generating
specific targets with similar features, such as single buildings. They also produce low-resolution outputs because they
use a variational loss to approximate the probability density by optimizing a lower bound on the log-likelihood function.
Additionally, the high technical threshold hinders research and applications for non-experts in AI.

3.4 Approaches Using 3D-Aware Image Synthesis

Most 3D-aware image synthesis methods utilize GAN-based models to sample latent vectors and decode them into target
3D representations. Thus, they offer 3D view-consistent rendering, efficient representation, and interactive editability,
as well as integration capabilities with other deep learning generation techniques [165], which could contribute to
virtual architectural generation.
Manuscript submitted to ACM
18 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

3.4.1 3D-Aware Image Synthesis and Its Editability. 3D-aware image synthesis achieves 3D view-consistent rendering
by relying solely on supervised 2D images and employing different neural rendering techniques. Recent research has
focused on integrating GAN-based models for generating 3D-aware images [27, 129]. For instance, HoloGAN is capable
of unsupervised learning from unlabeled 2D images, eliminating the need for pose labeling, 3D shape, or consistent
views [129]. It represents a significant advancement in unsupervised learning from natural images. Studies, such as
Pi-GAN [27], StyleSDF [136], and StyleNERF [59], have demonstrated improvements in two critical areas for 3D-aware
synthesis: resolution and multi-view consistency of synthetic images. For instance, StyleSDF, based on signed distance
functions (SDF), produces detailed 3D surfaces, yielding visually and geometrically superior results [136]. Furthermore,
cutting-edge techniques integrate 3D-aware images with CLIP models, enabling 3D geometry generation from natural
language descriptions [78, 175]. DreamFusion, for example, enhances generation efficiency through a loss derived from
the distillation of a 2D diffusion model [145]. Notably, advancements in 3D-aware synthesis extend beyond pre-trained
2D image-to-text models, with methods like DreamFusion incorporating diffusion models as strong image priors, further
improving generation efficiency. Additionally, 3D-aware synthesis can integrate other deep generative models, such as
recent diffusion models, which will be discussed in the following subsection.
The primary objective of 3D-aware image synthesis is to achieve explicit control over camera pose, enabling more
engaging and interactive user interactions with scenes. Some approaches also support object pose editability; for
example, GIRAFFE allows panning and rotating 3D objects in the scene [132]. Similarly, StyleNeRF enables alteration of
style attributes, supporting style blending, inversion, and semantic editing of the generated results [59]. This editability
offers various solutions for refining generated targets from different perspectives.

3.4.2 3D-Aware Image Utilized in Architecture. 3D-aware synthesis shows promise for virtual architecture regarding
representation conversion, controllability, and multi-modality with linguistic descriptions. As OrEl et al. demonstrated in
StyleSDF, converting implicit representations to explicit geometry enables designers to comprehend and edit architectural
elements with precision visually [136]. Similarly, Gu et al. showcased in StyleNeRF how explicit control over style
attributes allows for creating diverse architectural designs with semantic editing capabilities [59]. Recent advancements
include Meng et al.’s introduction of a Colab configuration for creating buildings based on DreamField, which supports
conditional text input, parameter adjustment, and style attributes 14 . Additionally, Guida et al. have generated buildings
with 3D representation by leveraging NeRF from generative images produced by Stable Diffusion [61]. This method
expands the potential for generating a wider variety of complex forms. This implementation underscores the practical
application of 3D-aware synthesis in architectural design, providing designers with intuitive tools to realize their
creative vision. However, concerns arise regarding its effectiveness, which is often limited to simple objects lacking high
resolution and precise internal structures due to non-design datasets. To address these limitations, Guida et al. utilized
integrated stable diffusion with text-to-image synthesis to supplement non-design datasets, effectively enhancing the
diversity of form-finding [61]. Nevertheless, the efficiency of generating complex architectural structures with both
internal and external details remains constrained by the characteristics of deep generative models.

3.5 Emerging Approaches Using Diffusion Models

3D diffusion models are capable of high quality with fine details in a robust and controlled way for 3D generation,
particularly complex shapes with intricate details and specific attributes. Thus, their controllability and descriptive
input will contribute to deep-learning-assisted virtual architectural generation in the future.
14 Source: https://github.com/shengyu-meng/dreamfields-3D
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 19

3.5.1 3D Diffusion and Its Manipulability. Recently, 3D diffusion has gained a growing interest in generating 3D
shapes due to its outperformance abovementioned. Despite the flexibility of conditional diffusion sampling, previous
diffusion models are limited to sampling pixels of images. In contrast, Ben et al. have innovatively applied diffusion
models to generate 3D models directly through applying denoising in high-quality 3D-aware image synthesis named
DreamFusion [145]. This work sparkes various follow-up approaches regarding text-to-3D (e.g., Magic3D [103], Pro-
lificDreamer [176]) and image-to-3D (e.g., PVD [198], Magic123 [147], MVDream [164]) with improved generation
efficiency, resolution of geometry and texture. For instance, LION concentrates on greater flexibility regarding operation
and application by leveraging conditional synthesis and shape interpolation within a combination of diffusion models
and VAE [191]. While Magic123 with image input focuses on reconstructing a high-resolution and textured 3D mesh
using joint 2D and 3D priors [147].
Diffusion models facilitate the generation of 3D shapes with outperformed controllability in two ways: modifying
style attributes and manipulating conditioned descriptive text. Firstly, most methods allow modification of specific
style attributes, such as shape and texture, during the generation process [59, 78, 132]. This progress suggests that
diffusion models offer a robust and controlled approach to 3D shape generation, particularly for complex shapes
with intricate details and specific attributes. Secondly, some burgeoning methods feature text-to-3D generation with
human interpretative intention by manipulating text prompts [28, 103, 109, 130, 145]. Earlier work utilized CLIP for 3D
generation and manipulated text prompts, such as PointE [130]. Comparably, current works have improved output
results by incorporating stronger synthesis priors, such as DreamFusion [145].

3.5.2 3D Diffusion Model Utilized in Architecture. Pioneering design works utilizing diffusion models have proliferated
recently. While most of these works only utilize diffusion models to generate 2D images coupled with additional
computational methods for 3D transposition, such as LoRA and NeRF [61]. For instance, AIG studio’s work 15 has
achieved a modeling framework through stable diffusion, generating plan drawings by relying on heatmaps of urban
morphology and additional LoRA models. Comparably, an alternative method employs diffusion models for nonlinear
shape interpolation to create a variety of 3D forms [94, 108, 159]. For example, one investigation utilized a 3D point
cloud probabilistic diffusion model to generate Taihu stone in specified target shapes [108]. Similarly, another study
adopted volumetric density grids as the 3D representation within denoising diffusion models, providing a range of
possibilities for architectural design [159].
Some initial research in design has also leveraged diffusion models regarding integrating with human interpretation,
interaction usability, and multimodality. An earlier work by Zhang developed a framework capable of compiling analytic
semantic information from input texts, then generating 3D architectural form according to descriptive language [193]. In
this framework, different usages of spaces were trained with adjacent matrices to understand the linguistic instructions.
The approaches of modifying attributions and conditioned descriptive text broaden the usage case for other layman
users by interpreting textual prompts. For instance, Magic 3D has developed a toolkit that offers advanced control over
image instances of 3D generation regarding stylization and content based on text prompts [103].
Additionally, recent work has contributed to integrating tools to improve common tedious workflows [60], such as
constantly transforming models between multiple software or methods. In this work, Guida developed a plug-in with
a user interface embedded in Rhino, integrating multiple approaches of 3D form generation for designers [60]. This
work underscores the significance of multimodality for designer demands. The cutting-edge diffusion model’s high

15 Source: https://www.bilibili.com/video/BV1Qb411Z7UP/
Manuscript submitted to ACM
20 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

resolution, denoising, and rapid generation democratizes 3D geometry generation, providing access for individuals
with varying levels of expertise to produce creatively.

4 CHARACTERISTICS OF GENERATED VIRTUAL ARCHITECTURE IN DESIGNER-AI COLLABORATION

We found some shared methodology characteristics of generative building utilizing deep learning, including scale, scope,
operability, and autonomy. Each characteristic has a different degree regarding different design goals and generative
models.

4.1 Scale
Scale refers to the incorporated model sizes regarding generative capability in architecture, encompassing both the
scales of the models themselves and the design context. Large scale pertains to generated city and street areas with
diverse design contexts, such as landscapes and river areas or functional planning. Conversely, small scale refers to a
generated single building with exterior or interior concerns while lacking environmental design context. The degree of
scale is influenced by two factors: the capability of incorporated composite workflows and the divergent features or
capabilities of generative models.
Divergent generative frameworks possess different capabilities in generating model scale. Firstly, the capability
of incorporated composite workflows enables the generation of larger-scale models in multiple phases or modules
in a generation. For example, Tencent proposed a large-scale automatic generation solution of 3D virtual scenes
incorporating various methods in three modules: city layout generation, building exterior generation, and interior
mapping generation [97]. GANs exhibit such compound capability. Secondly, visual processing capabilities in deep
generative models like GANs facilitate the generation of larger-scale models utilizing plan images. For instance, Kim et
al. adopted two subtasks of GAN and CNN (convolutional neural network) to construct a 3D city model through scene
parsing, city property vectors, and terrain maps. In this work, terrain maps of the city plans are crucial for generating a
large city model from the plan.
Furthermore, different models possess various tolerances regarding design context. Firstly, a more hybrid workflow
leads to a wider range of design contexts. For instance, the diffusion model can handle broader design contexts since
it can merge mixed methods for generation, such as vertical generation by depth map, 3D synthesis by NeRF, and
descriptive language. Secondly, the higher capability of accommodating data diversity and complexity results in a wider
range of design contexts, such as with GANs and diffusion models. In contrast, VAEs, while proficient in learning
latent representations and generating data, are limited in capturing complex spatial and contextual relationships due
to model complexity and technology limitations. For example, Miguel’s VAE model to generate a single wireframe
3D building worked with around 150 million parameters, accompanied by only 10% of the dataset’s samples [114].
Therefore, working with larger geometries requires a significantly increased amount of data to balance training and
validation. Additionally, the number of input parameters scales proportionally with the bounding volume of the input
geometry when scaling to large structures, including conditioned context, increasing exponentially.

4.2 Scope
Scope refers to the range of generation or collaboration involved in a design process with deep learning assistance.
A wide scope typically involves employing more complex processing techniques to delve into deeper relationships
between design and generative modeling. Conversely, a specific scope usually entails addressing a particular and tangible
objective. Recent work has concentrated on complex processing with a wide scope, such as semantic interpretation
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 21

[72, 158], and latent space understanding [73], in addition to incorporating additional workflows for design criteria
[83], particularly in 3D solid form generation approaches. This emphasis stems from the fact that design fields are in
the early stages of development and require a comprehensive understanding of deep learning.
The degree of scope depends on two aspects: the involvement of process scope in the design and deep learning-
assisted design or human-assisted design. Firstly, involvement in more design phases represents a wider scope, including
data training and processing, solution set generation, and evaluation (as illustrated in Section 2.3). For instance, Mueller
conducted the entire scope of design work involving pre-processing datasets, refining and fine-tuning models, and
evaluation based on GANs to test the impact of deep learning on design output [120]. By comparison, Zhang et al. merely
applied GANs to generate a set of section images for 3D transposition without involving other phases, representing a
specific scope [192].
Secondly, the work involving human-assisted design has a wider scope than deep learning-assisted design regarding
design generation since it accommodates additional processes beyond design generation, such as dataset augmenta-
tion [114] and semantic labeling [193]. For instance, Huang et al. extended architects’ capabilities by combining the
GAN generation with analytical interpretation processes of NLP (natural language processing) instead of solely focusing
on design generation [72]. Such research addresses the limitations of DL-assisted design generation and compensates
for them by incorporating a wider scope of design perspectives. In contrast, most studies relying on generative images
only investigate DL-assisted design without addressing other concerns [186, 192].

4.3 Operability
Operability refers to the extent of technical or design operation involved in design collaboration with DGMs. Low
operability indicates the straightforward application of DGMs to design tasks, while high operability signifies the
ability to operate within the technical space in a work with improved functionality and efficiency. Technical operations
encompass various types across different models, including fine-tuning model parameters [120], refining training
procedures [20, 120, 158], customizing architecture [83], and adapting methodology applicability [73].
The degree of operability is influenced by the complexity of technical operations and the applicability of methodology.
Firstly, the different focuses of design work contribute to varying levels of technical operations and the applicability
of the methodology. For instance, Veselý demonstrated high operability by unifying methodology applicability and
fine-tuning GAN models for different scales and scopes, including adjusting hyperparameters and optimizing the
training process to achieve better convergence and output quality [134]. Similarly, Huang et al. explored the latent
space potential by connecting structured segmentation of text and visual outputs in generative images of the diffusion
model, establishing robust analytical frameworks operation [73]. Secondly, different designs exhibit varying levels of
applicability due to model divergence. Generally, VAEs require relatively high operability from professionals since it
processes low-dimensional 3D feature data of buildings. In contrast, diffusion and GAN offer various levels of operability
with flexibility ranging from application to fine-tuning, while 3D-aware synthesis requires an intermediate level of
operability.

4.4 Autonomy
Autonomy refers to the level of automation in the generation process facilitated by deep learning techniques. High
autonomy indicates a largely automated process driven by a unified workflow, while low autonomy involves manual
intervention, requiring the connection of workflows and greater ease of use. In architectural design, most works exhibit
relatively low autonomy to ensure quality and design accuracy under the guidance of designers. For example, Mueller
Manuscript submitted to ACM
22 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

Research Pre-process and training Adaptive integrating Explainability, Interpretation, Editability,

Agendas dataset for designer-assisted AI tool Participatory design in human-AI Explainable AI

Various formats
GAN Characteristic VAE Single format
(Architectural components) Dataset Dataset
(3D features)
Multiple Design intuition Design intuition Specific
Scale
3D form generation from 3D form features mapped from
Diverse Multimodality probabilistic spaces continious latent space Multimodality N/A
or 2D image sets
Generative Generative
Many research Scope All research
model model

Various formats
(Architectural components) Dataset DPM 3D-Aware Image Synthesis Dataset Non-specialized
Operability
Relatively Specific
Multiple Design intuition Design intuition (Scenario, stylization)

Flexible and 3D form generation 3D scene generation coherently

Multimodality Flexible and
diverse progressively and contextually with various Multimodality
Autonomy diverse
refined through diffusion steps attributes.
Generative Generative
Little research N/A
model model

Fig. 9. A summary diagram illustrating the connections among key sections, including the overview (Section 2), generative approaches
(Section 3), and characteristics of these approaches (Section 4). Accordingly, these high-level issues, identified within the four key
sections of our survey, serve as the basis for further research agendas (Section 5).

validated the methodology across diverse scenarios by selectively retrieving spatial data and comparing the performance
of proposed generative models [120]. However, recent advancements show a proliferation of high autonomy, especially
in 3D generation, focusing on task efficiency [90, 97]. Nevertheless, there is a growing trend towards higher autonomy in
design work to broaden access and enhance usability. For instance, Guida developed an integrated tool within the visual
scripts of Grasshopper with high autonomy, combining all processes and modules in architectural design, providing
convenient input and output modules across different approaches [60].
The degree of autonomy is influenced by the complexity of the design purpose and model processing requirements.
Firstly, researchers tackling challenging and complex design objectives, such as connecting generative models with
design output, tend to exhibit low autonomy. For instance, Mueller endeavored to design goals and each step of
generative models, including training datasets and key hyperparameters [120]. Secondly, the model processing with
frequent transposition across platforms and tools in design leads to low autonomy. For example, Özel incorporates
additional procedural steps after generating a 2D image, requiring high reliance on manual designer intervention [138].
Thirdly, the absence of unified criteria further leads to low autonomy. Özel’s work, for instance, involves experimenting
with various 3D modeling software and photogrammetry, indicating low autonomy and hindering further development
in architectural design towards [138].

5 RESEARCH AGENDAS
In the previous sections, we explored the urgent need for HCI and DL-assisted methods in generated virtual architecture.
We thoroughly examine the literature and elucidate four key focuses, as mentioned in Section 2.3. These key focuses
serve as the foundation for our research agendas, formulated from a perspective of high-level questions (Fig. 9). Our
investigation of current literature and works also reflects on these questions. Therefore, we advocate the following
research agendas to promote research in this area.
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 23

5.1 Agency Between Designer and Machine in Data Processing and Training
Agency between humans and machines emerges as a significant feature in AI-assisted design and designer-AI collab-
oration. On the one hand, a human-driven agency faces challenges due to low-resolution and incomplete datasets,
requiring human intervention [163]. On the other hand, an AI-driven agency raises concerns about intellectual property
issues but offering higher efficiency [60]. Our analysis of the survey identifies two main barriers to achieving a balance
between human and AI agencies: flexible datasets and human-assisted machine learning.
Flexible and comprehensive datasets are crucial for achieving balanced agency, supporting various design purposes
with precision and semantic information. However, limitations in dataset availability pose significant challenges in both
industries and academia. As noted by Regenwetter et al., three main issues result in this challenge: lack of available 3D
datasets, insufficient data size, and data sparsity and bias [149]. The limited availability of 3D building datasets restricts
massive computation in 3D solid generation, hindering designers’ ability to find appropriate datasets. Correspondingly,
our review expounds that only a few studies contributed to the 3D building dataset (N=3). While datasets like Point-E
and GET3D show promise in improving efficiency and reducing designer labor, architectural work highlights the
complexity of preprocessing datasets to suit various building scales and design purposes.
In addition to addressing dataset limitations, human-assisted deep learning could be a viable solution in design
work, including training (e.g., semantic analysis [72, 158, 193]) and regulation (e.g., post-processing tasks [83]). For
example, Liu et al. regulated AI’s behavior for screening target outputs with manual labels based on certain architectural
criteria of 3D form configurations from a generative script [106]. Our analysis of the survey reveals that human-assisted
machine learning is an emerging and underexplored method, with few studies focusing on this area (N=4). It holds
significant potential for specialized design and production pipelines in DL-generated architecture, such as building
semantic interpretation and regulating design criteria. Accordingly, some researchers question whether future designers
working with training data in established deep learning workflows may spend more time than on result output [158].
Additionally, data customization could offer a solution by tailoring datasets to specific formats, precision, and sizes
for multiple design purposes, requiring collaboration among enterprises, managers, platforms, and users. Therefore,
further research on agency between generation and design requires close collaboration with technicians, designers, and
producers through human-assisted machine learning.

5.2 Augmenting Communication Between Design and Artificial Intelligence

To benefit the design project’s fitting purpose, HCI researchers and designers should augment communication with
each other by promoting tools or enhancing workflows between design applications and deep learning. We found two
aspects closely related to this agenda: explainability in designer-AI collaboration and editability in the generative model.
Explainability. Explainability is one recognized principle of communication within designer-AI collaboration, essential
for fostering trust in AI systems and guiding the design process and outcomes [119]. However, a significant disparity
exists in the developmental progress between HCI and the design field. For instance, HCI researchers have made tremen-
dous strides in responsible AI for explainability, focusing on user experience, trustworthiness, and collaboration [152].
The design field lags due to existing challenges and a lack of expertise among designers in negotiating input, process,
and controlling output. According to our analysis of the survey, no one has addressed this critical issue. Regardless,
recent studies (N=2) have primarily investigated methods for controlling output by manipulating detail in algorithms
within generative models to achieve greater controllability in design outputs. Therefore, future research in HCI and
design should delve deeper into explainability in design and generation processes building upon design principles.
Manuscript submitted to ACM
24 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

Editability. Editability is the other key to communication, relying on timely feedback and adjustments to model 3D
objects. Similarly, there is a big gap in development progress between the generation technique and design fields. Prior
work has demonstrated editing autonomy in 3D shape generation techniques available to users. For example, Liu et
al. proposed a user-centric 3D shape generative method by drawing a target model as 3D voxel grids [107]. While a
few works demonstrate some indirect approaches to editing 3D models with implicit representations [41, 65, 75, 197].
According to our survey, although none of the architectural works mainly contribute to editing in 3D generation, one
indicated a potential for variability in 3D form. This work generates infinite continuous forms by interpolating 3D
features among multiple wireframe representations with compact implicit functions [114]. Through this interpolation
with implicit representations, we could envision a broad use for virtual architecture in supporting itself with real-time
dynamic morphology. Thus, the editability potential of implicit representation incorporating building form launches
the future agenda for deep learning virtual architecture design.
In contrast to 3D shape generation, 3D-aware image synthesis precludes direct user manipulation in 3D space and
requires latent vector editing to control the composition, shape, and appearance. These latent vectors model all other
variables that are not captured by physical factors while being able to control small changes in the scene, such as lighting
and coloring [182]. Furthermore, several approaches allow additional inputs to alter the editing of the scene, such as
textual descriptions, semantic labels [52, 77, 109], images [153], and parameter controls. Therefore, this approach could
be the complementary editing layer to deep learning-assisted design. For instance, 3D representation from a user end in
a UI panel can be controlled according to architecture settings. Therefore, this editability is one of the key agendas for
the rapid popularity of virtual architecture with AIGC-driven 3D designs.

5.3 Enhancing User Consideration with Diversity and Collectives under Designer-AI Collaboration
Designs involving multiple users necessitate inclusivity by incorporating usage affordances and considering wide
needs for virtual architecture since the virtual environment pertains to shared environments for collectives and
individuals from diverse backgrounds and cultures through converging and engaging, leading to highly multi-social
experiences [13, 81]. Accordingly, we identify two focal conflicts regarding this agenda: the balance between collective
and individual, as well as culture and efficiency.
Efficiency and Diversity. Deep learning enables the rapid creation of large-scale architectural environments but often
lacks diversity in catering to multiple users and their objectives. Conversely, diverse buildings may exhibit non-uniform
quality criteria, such as inconsistencies between manually and automatically generated structures [86, 185]. Regarding
this, spatial accuracy and level of detail are significant criteria for such virtual building quality [185], including scale,
size, granularity, and simulation fidelity [47]. Additionally, diversity also leads to a high cost of human resources [185].
Reliable solutions entail unified pipelines and highly automated computing or provide user-friendly toolkits with
standardized governance by platforms. For example, Tencent AI Lab proposed a high-automated solution for generating
3D virtual scenes with a unified pipeline consisting of various modules, supporting layman with industry produc-
tion [97] 16 . Similarly, Mozilla Hubs’ Spoke offers a robust toolkit for 3D scene editing catering to a wide range of
users.
Individuals and Collectives. A significant challenge arises from the conflict between individual-centric ideologies of
new liberalism and the collective nature of the virtual environment [124]. While the exhibiting of ownership, identity,
and recommended content prominence the precise value of an individual in expansive virtual spaces under new

16 Source: https://gdcvault.com/play/1028921/Recorded-AI-Enhanced-Procedural-City
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 25

liberalism [124]. Within this dynamic environment, we are increasingly accused of proliferating products manifesting
our unique identities and personal needs [124]. This situation leads to an urgent rethinking of the collaboration with
virtual worlds regarding the relationship between collectivism and individualism.
One potential solution is public participation in the design process. Designers have conducted participatory design
experiments positively, addressing multifaceted problems and designing for minority groups. Participatory design
allows non-professionals to make design decisions collaboratively alongside professional designers. The emergence of
creative collectives in architecture underscores the importance of trends in public participation [50]. However, our
analysis of the survey found no research on participatory concerns for 3D generation or DL-assisted architectural design.
In contrast, there has been a significant achievement in participation topics within designer-AI collaboration [167].
Therefore, we urge designers and researchers to emphasize the public participatory potential for virtual building by
referencing existing research.

5.4 Designing Adaptive Tool by Integrating Multiple Tools with Multimodal User Input and Output
Integrating tools results in wide access with a user-friendly interface towards rich user-created content in the future
metaverse. Although this vision has been underscored in the virtual environment industry, the adaptive tool for
diverse users to design fields is unseasoned. The reason behind this agenda is the lack of applications that integrate
multimodality and consider user experience.
Multimodality. As aforementioned, the previous tremendous progress in DL-assisted architectural design has been
built upon "visual" representation with the generative image since "visual" modality as design essential directly links
design purpose and output representation [163]. Natural language has also been pivotal in multimodality for design. As
Markus emphasized, "language is at the core of making, using, and understanding buildings" [51]. Correspondingly, our
analysis of the survey underpins future trends from the latest boom in text-to-3D and 2D-to-3D approaches in the 3D
generation field. However, very few architectural designs concentrate on this method due to its novelty. Compared
to the industry, huge success has been applied to user-end industrial tools, such as Midjouney and DALLE2, which
have wide access to diverse designers with various purposes and alter the design workflows. Given this sense, the
other modality combined with visuals could boost emerging designs from tool and technical perspectives, especially for
virtual architecture from 3D generation. Therefore, bridging visuals and language with 3D form as one research agenda
could reform architectural design, particularly in a virtual environment.
Integrating Tool. User-friendly tools would also inspire broadened usage, facilitating designers’ work. In this regard,
we anchored significant positions to illuminate this agenda from the industry-research gap and the technique-application
gap according to the analysis of our surveyed paper. Firstly, current architectural work still focuses on architectural
forms and innovative design methods connecting designers and techniques. According to our analysis of the survey, only
one research developed integrated tools for architectural design generation (N=1). Secondly, most research in CV fields
contributes novel algorithms and frameworks, ignoring further utilizing technique innovation in the design application.
Yet integrating tools is crucial to bridge industries, and research facilitates future virtual architecture development.
Our investigation provides such a critical lens for understanding the gap between architecture and such models to
further utilize such generative models for virtual architectural design, including Dreamfusion3D [145], Point-E [130],
and Get3D [53]. Therefore, we appeal to interdisciplinary research to conduct these untilled topics to bridge the gap
between the two fields to follow the industry’s advanced progress, especially for HCI and design researchers.

Manuscript submitted to ACM

26 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

5.5 Research Directions and Future Opportunities

Our survey unveils that this field necessitates interdisciplinary collaboration at the intersection of 3D generation, virtual
reality, sensing technologies, and wearable computing. Thus, we address three future directions of generated virtual
architecture.

5.5.1 Designing User-Centric Adaptive Virtual Buildings. Recent advancements in wearable technology incorporating
sensors and actuators have significantly enhanced the study of virtual spaces within a design framework [17, 46]. Virtual
architecture research has positively conducted interaction with the virtual environment by collecting user bio-signal
data [43, 70, 122, 128, 142, 161]. This trend embodies that feedback and interaction between the built environment and
humans are crucial for creating livable and adaptive virtual environments. Two significant opportunities for future
exploration emerge within this context: Firstly, integrating emotional computing systems into 3D generation approaches
is crucial for designing emotionally resonant spaces [70]. For example, Sheehan et al. demonstrated the simulation of
atmospheric qualities in VR to explore space and directionality, using thermal perception as feedback [161]. Secondly,
utilizing real-time data captured by ubiquitous computing and wearable devices allows for personalized and responsive
user interactions in virtual environments [64, 190]. For instance, Tosello creatively translated physiological data in the
presence of virtual scenarios, including thoughts and emotions, into digital space [173]. This approach facilitates the
creation of interactive virtual architectures that dynamically adapt to user behavior and preferences in real-time, thus
expanding the scope of research in responsive design. For instance, users could customize virtual buildings in real-time
by adjusting parameters and weights in generative models based on ubiquitous data collected from wearable devices.
Therefore, we envision the future of virtual architecture to be centered on interactive, self-responsive forms generated
by deep learning approaches.

5.5.2 Engaging Users in Governance and Content Generation: A Role Shift as Consumers. Virtual buildings, serving as 3D
digital resources of content and space, play a pivotal role in fostering the content ecosystem in the virtual environment,
which involves consumers, creators, and platforms. In a recent study, Ding et al. demonstrated an innovative approach
to user-personalized architectural design in virtual space, utilizing a user-friendly tool for designer-AI collaboration [44].
This tool showcases an interface and features tailored for creating virtual architecture, aiming to ensure accessibility
and ease of use. Tools featuring user-friendly interfaces and customization options can empower consumers to assume
governance and content-generation roles. Therefore, we believe multiple approaches enabling user engagement in
various roles represent the prevailing trend within the virtual environment context.

5.5.3 Generating Innovative Virtual Forms and Dynamic Morphology with Social Sustainability. Deep learning is pro-
pelling virtual architecture towards innovative forms shaped by algorithmic principles, surpassing conventional visual
geometries. The algorithmic form encapsulates the relationship between computational processes and the resulting
object characteristics [25, 26]. For instance, recent research has proliferated in aesthetics assessment [85, 154, 172],
regarding visual character [170], and visual sense [178]. In this context, an increasing number of social algorithm
proposals integrate social factors like environmental and social sustainability into deep learning frameworks. For
instance, Koh et al. employed 3D GANs to design a series of housing buildings ensuring sustainable economic balance
and social resident privacy [93]. Moreover, the instantaneous feedback loop in environmentally intelligent virtual
architecture aids its management and governance, streamlining activities like building reviews and further adaptation.
This responsive capability fosters a collaborative environment.
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 27

5.5.4 Ethical implications of AI-generated architecture. In an era of increased collaboration with AI, it is essential to
examine the ethical implications of originality and authorship, the morality of machine creation, and human rights
within a pervasive data environment. In this context, user privacy and authorship are of utmost importance.
Designers must work closely with AI experts and developers to ensure data privacy, including confidentiality,
integrity, and availability. As designers become more involved in AI procedures such as deployment, training, and
decision-making as the party of developers, they have a responsibility to raise awareness and take action to protect
data privacy [58].
AI-generated content in virtual digital environments presents significant challenges related to authorship, ownership,
and copyright within a complex content ecosystem due to zero-cost reproduction and machine intelligence [174]. For
example, original creations can be utilized by AI models to produce suspected pirated content without informing
the authors. While the industry offers solutions such as uploading creations with copyright-level options, challenges
persist regarding evaluation and feasibility, particularly in specific areas [174]. Specifically, the reuse of creations should
involve informing the author and obtaining permission. The benefits of distribution or authorship should be considered,
especially when users may serve in multiple roles, including potential creators, re-creators, distributors, and regulators.
Therefore, designers with a robust understanding of the field and a commitment to authorship play a crucial role in
establishing specific criteria to regulate the generation approaches of AI-generated architecture.

6 CONCLUSION
In this survey, we explore various approaches deep neural networks utilize to generate virtual architectures au-
tonomously. Our investigation spans three main domains: architectural design, 3D generation techniques, and virtual
environments for designer-AI collaboration. Through an in-depth literature analysis, we focus on four key aspects:
datasets, multimodality, design intuition, and generative frameworks. We aim to bridge the current research gap from
an interdisciplinary perspective.
Our survey highlights the lack of systematic research, particularly concerning the virtual dimensionality of architec-
ture and the collaborative design aspects of human-AI interaction. We advocate for comprehensive and interdisciplinary
research efforts encompassing various themes such as agency, communication, user considerations, tool integration,
and the involvement of diverse stakeholders, including technical developers, HCI researchers, architects, and virtual
environment practitioners.

REFERENCES
[1] Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas
Khosravi, U. Rajendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi. 2021. A Review of Uncertainty Quantification in Deep Learning:
Techniques, Applications and Challenges. Information Fusion 76 (2021), 243–297. https://doi.org/10.1016/j.inffus.2021.05.008
[2] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. 2018. Learning Representations and Generative Models for 3D
Point Clouds. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80). PMLR,
Stockholm, Sweden, 40–49. https://proceedings.mlr.press/v80/achlioptas18a.html
[3] Alankrita Aggarwal, Mamta Mittal, and Gopi Battineni. 2021. Generative Adversarial Network: An Overview of Theory and Applications.
International Journal of Information Management Data Insights 1, 1 (2021), 100004. https://doi.org/10.1016/j.jjimei.2020.100004
[4] Taofeek D. Akinosho, Lukumon O. Oyedele, Muhammad Bilal, Anuoluwapo O. Ajayi, Manuel Davila Delgado, Olugbenga O. Akinade, and Ashraf A.
Ahmed. 2020. Deep learning in the Construction Industry: A Review of Present Status and Future Innovations. Journal of Building Engineering 32
(2020), 101827. https://doi.org/10.1016/j.jobe.2020.101827
[5] Hamed S. Alavi, Elizabeth F. Churchill, Mikael Wiberg, Denis Lalanne, Peter Dalsgaard, Ava Fatah gen Schieck, and Yvonne Rogers. 2019.
Introduction to Human-Building Interaction (HBI): Interfacing HCI with Architecture and Urban Design. ACM Trans. Comput.-Hum. Interact. 26, 2,
Article 6 (mar 2019), 10 pages. https://doi.org/10.1145/3309714

Manuscript submitted to ACM

28 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

[6] Md Zahangir Alom, Tarek M. Taha, Chris Yakopcic, Stefan Westberg, Paheding Sidike, Mst Shamima Nasrin, Mahmudul Hasan, Brian C. Van Essen,
Abdul A. S. Awwal, and Vijayan K. Asari. 2019. A State-of-the-Art Survey on Deep Learning Theory and Architectures. Electronics 8, 3 (2019).
https://doi.org/10.3390/electronics8030292
[7] Mohammad Samiul Arshad and William J. Beksi. 2020. A Progressive Conditional Generative Adversarial Network for Generating Dense and
Colored 3D Point Clouds. In 2020 International Conference on 3D Vision (3DV). online, 712–722. https://doi.org/10.1109/3DV50981.2020.00081
[8] Karen El Asmar and Harpreet Sareen. 2020. Machinic Interpolations: A GAN Pipeline for Integrating Lateral Thinking in Computational Tools of
Architecture. In Blucher Design Proceedings. Editora Blucher, Medellín, Colombia, 60–66. https://doi.org/10.5151/sigradi2020-9
[9] Vahid Azizi, Muhammad Usman, Samarth Patel, Davide Schaumann, Honglu Zhou, Petros Faloutsos, and Mubbasir Kapadia. 2020. Floorplan
embedding with latent semantics and human behavior annotations. In Proceedings of the 11th Annual Symposium on Simulation for Architecture
and Urban Design (Virtual Event, Austria) (SimAUD ’20). Society for Computer Simulation International, San Diego, CA, USA, Article 11, 8 pages.
https://doi.org/10.5555/3465085.3465096
[10] Shanaka Kristombu Baduge, Sadeep Thilakarathna, Jude Shalitha Perera, Mehrdad Arashpour, Pejman Sharafi, Bertrand Teodosio, Ankit Shringi,
and Priyan Mendis. 2022. Artificial Intelligence and Smart Vision for Building and Construction 4.0: Machine and Deep Learning Methods and
Applications. Automation in Construction 141 (2022), 104440. https://doi.org/10.1016/j.autcon.2022.104440
[11] Mathias Bank, Viktoria Sandor, Kristina Schinegger, and Stefan Rutzinger. 2022. Learning Spatiality - A GAN method for designing architectural
models through labelled sections. In Proceedings of the 40st Conference on Education and Research in Computer Aided Architectural Design in Europe
(eCAADe). Ghent, Belgium, 611–619. https://doi.org/10.52842/conf.ecaade.2022.2.611
[12] Claudiu Barsan-Pipu, Nathalie Sleiman, Theodor Moldovan, and Kevin Hirth. 2020. Affective Computing for Generating Virtual Procedural
Environments Using Game Technologies. In Proceedings of the 40th Annual Conference of the Association for Computer Aided Design in Architecture
(ACADIA). Online and Global, 120–129. https://doi.org/10.52842/conf.acadia.2020.1.120
[13] Richard Bartle. 2003. Designing Virtual Worlds. New Riders Games. https://doi.org/10.5555/1196681
[14] Richard A. Bartle. 2010. From MUDs to MMORPGs: The History of Virtual Worlds. Springer Netherlands, Dordrecht, 23–39. https://doi.org/10.1007/978-
1-4020-9789-8_2
[15] Alessandro Bava. 2020. Computational Tendencies. https://www.e-flux.com/architecture/intelligence/310405/computational-tendencies/
[16] Heli Ben-Hamu, Haggai Maron, Itay Kezurer, Gal Avineri, and Yaron Lipman. 2018. Multi-chart Generative Surface Modeling. ACM Trans. Graph.
37, 6, Article 215 (dec 2018), 15 pages. https://doi.org/10.1145/3272127.3275052
[17] Isabella Bower, Richard Tucker, and Peter G. Enticott. 2019. Impact of Built Environment Design on Emotion Measured via Neurophysiological
Correlates and Subjective Indicators: A Systematic Review. Journal of Environmental Psychology 66 (2019), 101344. https://doi.org/10.1016/j.jenvp.
2019.101344
[18] Andrew Brock, Theodore Lim, J. M. Ritchie, and Nick Weston. 2016. Generative and Discriminative Voxel Modeling with Convolutional Neural
Networks. https://doi.org/10.48550/arXiv.1608.04236 Publication Title: arXiv e-prints ADS Bibcode: 2016arXiv160804236B.
[19] Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, and Bharath Hariharan. 2020. Learning Gradient
Fields for Shape Generation. In Computer Vision – ECCV 2020. Springer International Publishing, Cham, 364–381. https://doi.org/10.1007/978-3-
030-58580-8_22
[20] Başak Çakmak. 2022. Extending design cognition with computer vision and generative deep learning. Master’s thesis. Middle East Technical University.
[21] Wenming Cao, Zhiyue Yan, Zhiquan He, and Zhihai He. 2020. A Comprehensive Survey on Geometric Deep Learning. IEEE Access 8 (2020),
35929–35949. https://doi.org/10.1109/ACCESS.2020.2975067
[22] Mario Carpo. 2017. The Second Digital Turn: Design Beyond Intelligence. The MIT Press. https://doi.org/10.2307/j.ctt1w0db6f
[23] Stanislas Chaillou. 2020. ArchiGAN: Artificial Intelligence x Architecture. Springer Nature Singapore, Singapore, 117–127. https://doi.org/10.1007/978-
981-15-6568-7_8
[24] Stanislas Chaillou. 2022. The Advent of Architectural AI: A Historical Perspective. Birkhäuser, Berlin, Boston, 32–61. https://doi.org/10.1515/
9783035624045-005
[25] Gregory J. Chaitin. 1975. Randomness and Methematical Proof. Scientific American 232, 5 (1975), 47–53. http://www.jstor.org/stable/24949798
[26] Gregory J. Chaitin. 1975. A Theory of Program Size Formally Identical to Information Theory. J. ACM 22, 3 (jul 1975), 329–340. https:
//doi.org/10.1145/321892.321894
[27] Eric R. Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. 2021. Pi-GAN: Periodic Implicit Generative Adversarial Networks
for 3D-Aware Image Synthesis. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Nashville, TN, USA,
5795–5805. https://doi.org/10.1109/CVPR46437.2021.00574
[28] Kevin Chen, Christopher B. Choy, Manolis Savva, Angel X. Chang, Thomas Funkhouser, and Silvio Savarese. 2019. Text2Shape: Generating Shapes
from Natural Language by Learning Joint Embeddings. In Computer Vision – ACCV 2018. Springer International Publishing, Perth, Australia,
100–116. https://doi.org/10.1007/978-3-030-20893-6_7
[29] Z. Chen and H. Zhang. 2019. Learning Implicit Fields for Generative Shape Modeling. In 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 5932–5941. https://doi.org/10.1109/CVPR.2019.00609
[30] Lung-Pan Cheng, Eyal Ofek, Christian Holz, and Andrew D. Wilson. 2019. VRoamer: Generating On-The-Fly VR Experiences While Walking
inside Large, Unknown Real-World Building Environments. In 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). Osaka, Japan,
359–366. https://doi.org/10.1109/VR.2019.8798074
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 29

[31] Mollie Claypool. 2019. Discrete automation - architecture - e-flux. https://www.e-flux.com/architecture/becoming-digital/248060/discrete-

automation/
[32] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A. Bharath. 2018. Generative Adversarial Networks:
An Overview. IEEE Signal Processing Magazine 35, 1 (2018), 53–65. https://doi.org/10.1109/MSP.2017.2765202
[33] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2023. Diffusion Models in Vision: A Survey. IEEE Transactions on
Pattern Analysis and Machine Intelligence 45, 9 (2023), 10850–10869. https://doi.org/10.1109/TPAMI.2023.3261988
[34] Shaveta Dargan, Munish Kumar, Maruthi Rohit Ayyagari, and Gulshan Kumar. 2020. A Survey of Deep Learning and Its Applications: A New
Paradigm to Machine Learning. Archives of Computational Methods in Engineering 27, 4 (Sept. 2020), 1071–1092. https://doi.org/10.1007/s11831-
019-09344-w
[35] Andrew Davies, Stephan Manning, and Jonas Söderlund. 2018. When neighboring disciplines fail to learn from each other: The case of innovation
and project management research. Research Policy 47, 5 (2018), 965–979. https://doi.org/10.1016/j.respol.2018.03.002
[36] Matias Del Campo. 2021. Architecture, Language and AI-Language, Attentional Generative Adversarial Networks (AttnGAN) and Architecture
Design. In Proceedings of the 26th CAADRIA Conference. Hong Kong, 211–220. https://doi.org/10.52842/conf.caadria.2021.1.211
[37] Matias del Campo. 2022. Deep House - Datasets, Estrangement, and the Problem of the New. Architectural Intelligence 1, 1 (Aug. 2022), 12.
https://doi.org/10.1007/s44223-022-00013-w
[38] Matias del Campo, Sandra Manninger, M Sanche, and L Wang. 2019. The Church of AI—An examination of architecture in a posthuman design
ecology. In Proceedings of the 24th CAADRIA Conference. Wellington, New Zealand, 15–18. https://doi.org/10.52842/conf.caadria.2019.2.767
[39] Matias del Campo, Sandra Manninger, and Yining Yuan. 2022. Generali Center vienna austria. https://caadria2022.org/projects/generali-center-
vienna-austria/
[40] Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. 2021. Voxel R-CNN: Towards High Performance
Voxel-based 3D Object Detection. Proceedings of the AAAI Conference on Artificial Intelligence 35, 2 (May 2021), 1201–1209. https://doi.org/10.1609/
aaai.v35i2.16207
[41] Yu Deng, Jiaolong Yang, and Xin Tong. 2021. Deformed Implicit Field: Modeling 3D Shapes with Learned Dense Correspondence. In 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA, 10281–10291. https://doi.org/10.1109/CVPR46437.2021.01015
[42] Raffaele Di Carlo, Divyae Mittal, and Ondrej Vesely. 2022. Generating 3D Building Volumes for a Given Urban Context using Pix2Pix GAN.
In Proceedings of the 40th Conference on Education and Research in Computer Aided Architectural Design in Europe. Ghent, Belgium, 287–295.
https://doi.org/10.52842/conf.ecaade.2022.2.287
[43] Xinyue Ding, Xiangmin Guo, Tian Tian Lo, and Ke Wang. 2022. The Spatial Environment Affects Human Emotion Perception-Using Physiological
Signal Modes. In Proceedings of the 27th CAADRIA Conference. Sydney, Australia, 425–434. https://doi.org/10.52842/conf.caadria.2022.2.425
[44] Xianglei Ding, Manting Sun, Wenhao Bai, and Xin Li. 2022. Exploration of Architectural Design in the Co-construction Mode of User and Designer
in the Metaverse Environment. In 2022 7th International Conference on Systems, Control and Communications. ACM, Chongqing China, 55–60.
https://doi.org/10.1145/3575828.3575838
[45] Laurent Dinh, David Krueger, and Yoshua Bengio. 2015. NICE: Non-linear Independent Components Estimation. arXiv:1410.8516 [cs.LG]
[46] Nancy Diniz. 2019. Body Architectures - Real time data visualization and responsive immersive environments. In Sousa, JP, Xavier, JP and Castro
Henriques, G (eds.), Architecture in the Age of the 4th Industrial Revolution - Proceedings of the 37th eCAADe and 23rd SIGraDi Conference - Volume 2,
University of Porto, Porto, Portugal, 11-13 September 2019, pp. 739-746. CUMINCAD, Porto, Portugal. https://doi.org/10.52842/conf.ecaade.2019.2.739
[47] John David N. Dionisio, William G. Burns III, and Richard Gilbert. 2013. 3D Virtual worlds and the metaverse: Current status and future possibilities.
ACM Comput. Surv. 45, 3, Article 34 (jul 2013), 38 pages. https://doi.org/10.1145/2480741.2480751
[48] Jiahua Dong, Qingrui Jiang, Anqi Wang, and Yuankai Wang. 2023. Urban Cultural Inheritance: Generative Adversarial Networks (GANs) Assisted
Street Facade Design in Virtual Reality (VR) Environments Based on Hakka Settlements in Hong Kong. In Proceedings of the 28th CAADRIA
Conference. Hong Kong, 473—-482. https://doi.org/10.52842/conf.caadria.2023.1.473
[49] Shi Dong, Ping Wang, and Khushnood Abbas. 2021. A Survey on Deep Learning and Its Applications. Computer Science Review 40 (2021), 100379.
https://doi.org/10.1016/j.cosrev.2021.100379
[50] Sara Eloy, Anette Kreutzberg, and Ioanna Symeonidou. 2021. Virtual Aesthetics in Architecture: Designing in Mixed Realities (1 ed.). Routledge, New
York. https://doi.org/10.4324/9781003183105
[51] Adrian Forty and Adrian Forty. 2000. Words and buildings: A vocabulary of modern architecture. Vol. 268. Thames & Hudson London.
[52] Rao Fu, Xiao Zhan, Yiwen Chen, Daniel Ritchie, and Srinath Sridhar. 2023. ShapeCrafter: A Recursive Text-Conditioned 3D Shape Generation
Model. arXiv:2207.09446 [cs.CV]
[53] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. 2022. GET3D: A
Generative Model of High Quality 3D Textured Shapes Learned from Images. arXiv:2209.11163 [cs.CV]
[54] Lin Gao, Tong Wu, Yu-Jie Yuan, Ming-Xian Lin, Yu-Kun Lai, and Hao Zhang. 2021. TM-NET: Deep Generative Networks for Textured Meshes.
ACM Trans. Graph. 40, 6, Article 263 (dec 2021), 15 pages. https://doi.org/10.1145/3478513.3480503
[55] Lin Gao, Jie Yang, Tong Wu, Yu-Jie Yuan, Hongbo Fu, Yu-Kun Lai, and Hao Zhang. 2019. SDM-NET: Deep Generative Network for Structured
Deformable Mesh. ACM Trans. Graph. 38, 6, Article 243 (nov 2019), 15 pages. https://doi.org/10.1145/3355089.3356488
[56] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna, William Freeman, and Thomas Funkhouser. 2019. Learning Shape Templates With
Structured Implicit Functions. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, South Korea, 7153–7163. https:
Manuscript submitted to ACM
30 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

//doi.org/10.1109/ICCV.2019.00725
[57] Harshvardhan GM, Mahendra Kumar Gourisaria, Manjusha Pandey, and Siddharth Swarup Rautaray. 2020. A Comprehensive Survey and Analysis
of Generative Models in Machine Learning. Computer Science Review 38 (2020), 100285. https://doi.org/10.1016/j.cosrev.2020.100285
[58] Abenezer Golda, Kidus Mekonen, Amit Pandey, Anushka Singh, Vikas Hassija, Vinay Chamola, and Biplab Sikdar. 2024. Privacy and Security
Concerns in Generative AI: A Comprehensive Survey. IEEE Access 12 (2024), 48126–48144. https://doi.org/10.1109/ACCESS.2024.3381611
[59] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. 2021. StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image
Synthesis. arXiv:2110.08985 [cs.CV]
[60] George Guida. 2023. Multimodal Architecture: Applications of Language in a Machine Learning Aided Design Process. In Proceedings of the 28th
CAADRIA Conference. Ahmedabad, India, 561–570. https://doi.org/10.52842/conf.caadria.2023.2.561
[61] George Guida, Daniel Escobar, and Carlos Navarro. 2023. 3D Neural Synthesis: Gaining Control with Neural Radiance Fields. In Proceedings of the
43rd Annual Conference for the Association for Computer Aided Design in Architecture (ACADIA). https://papers.cumincad.org/cgi-bin/works/paper/
acadia23_v2_420
[62] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. Improved Training of Wasserstein GANs. In
Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates
Inc., Red Hook, NY, USA, 5769–5779. https://doi.org/10.5555/3295222.3295327
[63] Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. 2021. Deep Learning for 3D Point Clouds: A Survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence 43, 12 (2021), 4338–4364. https://doi.org/10.1109/TPAMI.2020.3005434
[64] Zhe Guo, Ce Li, and Yifan Zhou. 2021. The method of responsive shape design based on real-time interaction process. In Proceedings of the 26th
CAADRIA Conference. Hong Kong, 345–354. https://doi.org/10.52842/conf.caadria.2021.2.345
[65] Zekun Hao, Hadar Averbuch-Elor, Noah Snavely, and Serge Belongie. 2020. DualSDF: Semantic Shape Manipulation Using a Two-Level
Representation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, 7628–7638. https:
//doi.org/10.1109/CVPR42600.2020.00765
[66] William Grant Hatcher and Wei Yu. 2018. A Survey of Deep Learning: Platforms, Applications and Emerging Research Trends. IEEE Access 6
(2018), 24411–24432. https://doi.org/10.1109/ACCESS.2018.2830661
[67] Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. 2019. Escaping Plato’s Cave: 3D Shape From Adversarial Rendering. In 2019 IEEE/CVF International
Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 9983–9992. https://doi.org/10.1109/ICCV.2019.01008
[68] Arsalan Heydarian, Evangelos Pantazis, David Gerber, and Burcin Becerik-Gerber. 2015. Use of Immersive Virtual Environments to Understand
Human-Building Interactions and Improve Building Design. In HCI International 2015 - Posters’ Extended Abstracts. Springer International Publishing,
Cham, 180–184. https://doi.org/10.1007/978-3-319-21380-4_32
[69] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. In Proceedings of the 34th International Conference
on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 574, 12 pages.
https://doi.org/10.5555/3495724.3496298
[70] Mitra Homolja, Sayyed Amir Hossain Maghool, and Marc Aurel Schnabel. 2020. The Impact of Moving through the Built Environment on Emotional
and Neurophysiological State - A Systematic Literature Review. In Proceedings of the 25th CAADRIA Conference. Bangkok, Thailand, 641–650.
https://doi.org/10.52842/conf.caadria.2020.1.641
[71] Tianzhen Hong, Zhe Wang, Xuan Luo, and Wanni Zhang. 2020. State-Of-The-Art on Research and Applications of Machine Learning in the
Building Life Cycle. Energy and Buildings 212 (2020), 109831. https://doi.org/10.1016/j.enbuild.2020.109831
[72] Jeffrey Huang, Mikhael Johanes, Frederick Chando Kim, Christina Doumpioti, and Holz Georg-Christoph. 2021. On GANs, NLP and Architecture:
Combining Human and Machine Intelligences for the Generation and Evaluation of Meaningful Designs. Technology|Architecture + Design 5, 2
(2021), 207–224. https://doi.org/10.1080/24751448.2021.1967060 Publisher: Routledge _eprint: https://doi.org/10.1080/24751448.2021.1967060.
[73] Sheng-Yang Huang, Yuankai Wang, and Qingrui Drolma Jiang. 2023. (In)Visible Cities: Exploring generative artificial intelligence’s creativity
through the analysis of a conscious journey in latent space.
[74] Le Hui, Rui Xu, Jin Xie, Jianjun Qian, and Jian Yang. 2020. Progressive Point Cloud Deconvolution Generation Network. In Computer Vision –
ECCV 2020. Springer International Publishing, Cham, 397–413. https://doi.org/10.1007/978-3-030-58555-6_24
[75] Moritz Ibing, Isaak Lim, and Leif Kobbelt. 2021. 3D Shape Generation with Grid-based Implicit Functions. In 2021 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). Nashville, TN, USA, 13554–13563. https://doi.org/10.1109/CVPR46437.2021.01335
[76] Abdul Jabbar, Xi Li, and Bourahla Omar. 2021. A Survey on Generative Adversarial Networks: Variants, Applications, and Training. ACM Comput.
Surv. 54, 8, Article 157 (oct 2021), 49 pages. https://doi.org/10.1145/3463475
[77] Tansin Jahan, Yanran Guan, and Oliver van Kaick. 2021. Semantics-Guided Latent Space Exploration for Shape Generation. Computer Graphics
Forum 40, 2 (2021), 115–126. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.142619
[78] Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, and Ben Poole. 2022. Zero-Shot Text-Guided Object Generation with Dream Fields. In
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, Louisiana, USA, 857–866. https://doi.org/10.1109/
CVPR52688.2022.00094
[79] Jean Jaminet, Gabriel Esquivel, and Shane Bugni. 2022. Serlio and Artificial Intelligence: Problematizing the Image-to-Object Workflow. In
Proceedings of the 2021 DigitalFUTURES. Singapore, 3–12. https://doi.org/10.1007/978-981-16-5983-6_1

Manuscript submitted to ACM

A Survey on Deep Learning for Design and Generation of Virtual Architecture 31

[80] Sun-Young Jang and Sung-Ah Kim. 2024. Automatic generation of virtual architecture using user activities in metaverse. International Journal of
Human-Computer Studies 182 (2024), 103163. https://doi.org/10.1016/j.ijhcs.2023.103163
[81] David H. Jonassen and Lucia Rohrer-Murphy. 1999. Activity Theory as a Framework for Designing Constructivist Learning Environments.
Educational Technology Research and Development 47, 1 (March 1999), 61–79. https://doi.org/10.1007/BF02299477
[82] Damjan Jovanovic. 2022. Games and Worldmaking. https://journal.b-pro.org/article/p3-games-and-worldmaking/
[83] Ridvan Kahraman, Christoph Zechmeister, Zhetao Dong, Ozgur S. Oguz, Kurt Drachenberg, Achim Menges, and Katja Rinderspacher. 2021.
Augmenting Design. In Proceedings of the 41th Annual Conference of the Association for Computer Aided Design in Architecture (ACADIA). Online
and Global, 112–121. https://doi.org/10.52842/conf.acadia.2021.112
[84] T. Karras, S. Laine, and T. Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In 2019 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 4396–4405. https://doi.org/10.1109/CVPR.2019.
00453
[85] Aysegul Akcay Kavakoglu. 2021. Computational Aesthetics of Low Poly: [Re]Configuration of Form. In Blucher Design Proceedings. online, 17–28.
https://doi.org/10.5151/sigradi2021-235
[86] Julian Keil, Dennis Edler, Thomas Schmitt, and Frank Dickmann. 2021. Creating Immersive Virtual Environments Based on Open Geospatial Data
and Game Engines. KN - Journal of Cartography and Geographic Information 71, 1 (March 2021), 53–65. https://doi.org/10.1007/s42489-020-00069-6
[87] Asifullah Khan et al. 2020. A survey of the recent architectures of deep convolutional neural networks. Artificial intelligence review 53, 8 (dec 2020),
5455–5516. https://doi.org/10.1007/s10462-020-09825-6
[88] Dongyun Kim, George Guida, and Jose Luis García Del Castillo Y López. 2022. PlacemakingAI : Participatory Urban Design with Generative
Adversarial Networks. In Proceedings of the 27th CAADRIA Conference. Sydney, Australia, 485–494. https://doi.org/10.52842/conf.caadria.2022.2.485
[89] Frederick Chando Kim and Jeffrey Huang. 2022. Perspectival GAN - Architectural form-making through dimensional transformation. In Proceedings
of the 40st Conference on Education and Research in Computer Aided Architectural Design in Europe (eCAADe). Ghent, Belgium, 341–350. https:
//doi.org/10.52842/conf.ecaade.2022.1.341
[90] Suzi Kim, Dodam Kim, and Sunghee Choi. 2020. CityCraft: 3D Virtual City Creation from a Single Image. The Visual Computer 36, 5 (May 2020),
911–924. https://doi.org/10.1007/s00371-019-01701-x
[91] Diederik P. Kingma and Max Welling. 2019. An Introduction to Variational Autoencoders. Foundations and Trends® in Machine Learning 12, 4
(2019), 307–392. https://doi.org/10.1561/2200000056
[92] Marian Kleineberg, Matthias Fey, and Frank Weichert. 2020. Adversarial Generation of Continuous Implicit Shape Representations.
arXiv:2002.00349 [cs.CV]
[93] Immanuel Koh. 2022. 3D-Gan-Housing (neural sampling series). https://caadria2022.org/projects/3d-gan-housing-neural-sampling-series/
[94] Immanuel Koh. 2023. AI-Bewitched Architecture of Hansel and Gretel: Food-to-Architecture in 2D & 3D with GANs and Diffusion Models.
Ahmedabad, 9–18. https://doi.org/10.52842/conf.caadria.2023.1.009
[95] Dominik Kreuzberger, Niklas Kühl, and Sebastian Hirschl. 2023. Machine Learning Operations (MLOps): Overview, Definition, and Architecture.
IEEE Access 11 (2023), 31866–31879. https://doi.org/10.1109/ACCESS.2023.3262138
[96] Krystian Kwiecinski, Jacek Markusiewicz, and Agata Pasternak. 2017. Participatory Design Supported with Design System and Augmented Reality.
In Proceedings of the 35th eCAADe Conference. Rome, Italy, 745–754. https://doi.org/10.52842/conf.ecaade.2017.2.745
[97] Tencent AI Lab. 2023. AI Enhanced Procedural City Generation. https://gdcvault.com/play/1028921/Recorded-AI-Enhanced-Procedural-City
[98] Lik-Hang Lee, Tristan Braud, Simo Hosio, and Pan Hui. 2021. Towards Augmented Reality Driven Human-City Interaction: Current Research on
Mobile Headsets and Future Challenges. ACM Comput. Surv. 54, 8, Article 165 (oct 2021), 38 pages. https://doi.org/10.1145/3467963
[99] Lik-Hang Lee, Tristan Braud, Pengyuan Zhou, Lin Wang, Dianlei Xu, Zijun Lin, Abhishek Kumar, Carlos Bermejo, and Pan Hui. 2021. All One Needs
to Know about Metaverse: A Complete Survey on Technological Singularity, Virtual Ecosystem, and Research Agenda. arXiv:2110.05352 [cs.CY]
[100] Lawrence Lessig. 1999. Code and Other Laws of Cyberspace. Basic Books, Inc., USA.
[101] Jun Li, Chengjie Niu, and Kai Xu. 2020. Learning Part Generation and Assembly for Structure-Aware Shape Synthesis. Proceedings of the AAAI
Conference on Artificial Intelligence 34, 07 (Apr. 2020), 11362–11369. https://doi.org/10.1609/aaai.v34i07.6798
[102] Chaohe Lin and Tian Tian Lo. 2021. Expanding the Methods of Human-VR Interaction (HVRI) for Architectural Design Process. In Proceedings of
the 26th CAADRIA Conference. Hong Kong, 173–182. https://doi.org/10.52842/conf.caadria.2021.2.173
[103] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin.
2023. Magic3D: High-Resolution Text-to-3D Content Creation. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Vancouver, Canada, 300–309.
[104] Or Litany, Alex Bronstein, Michael Bronstein, and Ameesh Makadia. 2018. Deformable Shape Completion with Graph Convolutional Autoencoders.
In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA, 1886–1895. https://doi.org/10.1109/CVPR.2018.
00202
[105] Chuan Liu, Jiaqi Shen, Yue Ren, and Hao Zheng. 2021. Pipes of AI – Machine Learning Assisted 3D Modeling Design. In Proceedings of the 2020
DigitalFUTURES. Springer Singapore, Singapore, 17–26. https://doi.org/10.1007/978-981-33-4400-6_2
[106] Henan Liu, Longtai Liao, and Akshay Srivastava. 2019. An Anonymous Ccomposition: Design Optimization Through Machine Learning Algorithm.
In Proceedings of the 39th Annual Conference of the Association for Computer Aided Design in Architecture (ACADIA). Austin (Texas), USA, 404–411.
https://doi.org/10.52842/conf.acadia.2019.404
Manuscript submitted to ACM
32 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

[107] Jerry Liu, Fisher Yu, and Thomas A. Funkhouser. 2017. Interactive 3D Modeling with a Generative Adversarial Network. 2017 International
Conference on 3D Vision (3DV) (2017), 126–134. https://api.semanticscholar.org/CorpusID:1360152
[108] Yubo Liu, Han Li, Qiaoming Deng, and Kai Hu. 2024. Diffusion Probabilistic Model Assisted 3D Form Finding and Design Latent Space Exploration:
A Case Study for Taihu Stone Spacial Transformation. In Phygital Intelligence, Chao Yan, Hua Chai, Tongyue Sun, and Philip F. Yuan (Eds.). Springer
Nature, Singapore, 11–23. https://doi.org/10.1007/978-981-99-8405-3_2
[109] Zhengzhe Liu, Yi Wang, Xiaojuan Qi, and Chi-Wing Fu. 2022. Towards Implicit Text-Guided 3D Shape Generation. In 2022 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). New Orleans, Louisiana, USA, 17875–17885. https://doi.org/10.1109/CVPR52688.2022.01737
[110] William E. Lorensen and Harvey E. Cline. 1987. Marching Cubes: A High Resolution 3D Surface Construction Algorithm. SIGGRAPH Comput.
Graph. 21, 4 (aug 1987), 163–169. https://doi.org/10.1145/37402.37422
[111] Sebastian Lunz, Yingzhen Li, Andrew Fitzgibbon, and Nate Kushman. 2020. Inverse Graphics GAN: Learning to Generate 3D Shapes from
Unstructured 2D Data. arXiv:2002.12674 [cs.CV]
[112] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. 2019. Occupancy Networks: Learning 3D Recon-
struction in Function Space. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos,
CA, USA, 4455–4465. https://doi.org/10.1109/CVPR.2019.00459
[113] Mateusz Michalkiewicz, Jhony K. Pontes, Dominic Jack, Mahsa Baktashmotlagh, and Anders Eriksson. 2019. Deep Level Sets: Implicit Surface
Representations for 3D Shape Inference. arXiv:1901.06802 [cs.CV]
[114] Jaime De Miguel, Maria Eugenia Villafañe, Luka Piškorec, and Fernando Sancho-Caparrini. 2019. Deep Form Finding Using Variational Autoencoders
for deep form finding of structural typologies. In Blucher Design Proceedings. Editora Blucher, Porto, Portugal, 71–80. https://doi.org/10.5151/
proceedings-caadesigradi2019_514
[115] Jaime De Miguel, Maria Eugenia Villafañe, Luka Piškorec, and Fernando Sancho Caparrini. 2020. Generation of Geometric Interpolations of
Building types with Deep Variational Autoencoders. Design Science 6 (2020), e34. https://doi.org/10.1017/dsj.2020.31
[116] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2021. NeRF: representing scenes as
neural radiance fields for view synthesis. Commun. ACM 65, 1 (dec 2021), 99–106. https://doi.org/10.1145/3503250
[117] Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, and Shubham Tulsiani. 2022. AutoSDF: Shape Priors for 3D Completion, Reconstruction and
Generation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA,
306–315. https://doi.org/10.1109/CVPR52688.2022.00040
[118] Ali Mohammad, Christopher Beorkrem, and Jefferson Ellinger. 2019. Hybrid Elevations using GAN Networks. In Proceedings of the 39th Annual
Conference of the Association for Computer Aided Design in Architecture (ACADIA). Austin (Texas), USA, 370–379. https://doi.org/10.52842/conf.
acadia.2019.370
[119] Sina Mohseni, Niloofar Zarei, and Eric D. Ragan. 2021. A Multidisciplinary Survey and Framework for Design and Evaluation of Explainable AI
Systems. ACM Trans. Interact. Intell. Syst. 11, 3–4, Article 24 (sep 2021), 45 pages. https://doi.org/10.1145/3387166
[120] Lisa-Marie Mueller. 2023. 3D Generative Adversarial Networks to Autonomously Generate Building Geometry. Master’s thesis. TU Delft. https:
//repository.tudelft.nl/islandora/object/uuid:b4a44f69-e3a4-4a29-85c4-77a7e899b81a?collection=education
[121] Joschka Mütterlein. 2018. The Three Pillars of Virtual Reality? Investigating the Roles of Immersion, Presence, and Interactivity. In Hawaii
International Conference on System Sciences. Hawaii. https://doi.org/10.24251/HICSS.2018.174
[122] Taro Narahara. 2022. Kurashiki Viewer: Qualitative Evaluations of Architectural Spaces inside Virtual Reality. In Proceedings of the 27th CAADRIA
Conference. Sydney, Australia, 11–18. https://doi.org/10.52842/conf.caadria.2022.1.011
[123] Charlie Nash, Yaroslav Ganin, S. M. Ali Eslami, and Peter Battaglia. 2020. PolyGen: An Autoregressive Generative Model of 3D Meshes. In
Proceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119). PMLR, online, 7220–7229.
https://proceedings.mlr.press/v119/nash20a.html
[124] Alina Nazmeeva. 2019. Constructing the virtual as a social form. Ph. D. Dissertation. Massachusetts Institute of Technology.
[125] Julien Nembrini and Denis Lalanne. 2017. Human-Building Interaction: When the Machine Becomes a Building. In Human-Computer Interaction -
INTERACT 2017. Springer International Publishing, Cham, 348–369. https://doi.org/10.1007/978-3-319-67684-5_21
[126] Kim J. L. Nevelsteen. 2018. Virtual World, Defined from a Technological Oerspective and Applied to Video Games, Mixed Reality, and the Metaverse.
Computer Animation and Virtual Worlds 29, 1 (2018), e1752. https://doi.org/10.1002/cav.1752
[127] David Newton. 2019. Generative Deep Learning in Architectural Design. Technology|Architecture + Design 3, 2 (2019), 176–189.
arXiv:https://doi.org/10.1080/24751448.2019.1640536
[128] Binh Vinh Duc Nguyen, Chengzhi Peng, and Tsung-Hsien Wang. 2019. KOALA - Developing a Generative House Design System with Agent-Based
Modelling of Social Spatial Processes. In Proceedings of the 24th CAADRIA Conference. Wellington, New Zealand, 235–244. https://doi.org/10.52842/
conf.caadria.2019.1.235
[129] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, and Yongliang Yang. 2019. HoloGAN: Unsupervised Learning of 3D Representations
From Natural Images. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA,
7587–7596. https://doi.org/10.1109/ICCV.2019.00768
[130] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. 2022. Point-E: A System for Generating 3D Point Clouds from
Complex Prompts. arXiv:2212.08751 [cs.CV]

Manuscript submitted to ACM

A Survey on Deep Learning for Design and Generation of Virtual Architecture 33

[131] Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Proceedings of the 39th International Conference
on Machine Learning (Proceedings of Machine Learning Research, Vol. 162). PMLR, Maryland, USA, 16784–16804. https://proceedings.mlr.press/
v162/nichol22a.html
[132] Michael Niemeyer and Andreas Geiger. 2021. GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields. In 2021
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 11448–11459. https:
//doi.org/10.1109/CVPR46437.2021.01129
[133] Erik Nijkamp, Mitch Hill, Tian Han, Song-Chun Zhu, and Ying Nian Wu. 2020. On the Anatomy of MCMC-Based Maximum Likelihood Learning
of Energy-Based Models. Proceedings of the AAAI Conference on Artificial Intelligence 34, 04 (Apr. 2020), 5272–5280. https://doi.org/10.1609/aaai.
v34i04.5973
[134] Ondrej Vesel‘y. 2022. Building massing generation using GAN trained on Dutch 3D city models. Master’s thesis. TU Delft. https://repository.tudelft.
nl/islandora/object/uuid:27085fd4-654a-4748-92d0-61563fe6040c?collection=education
[135] OpenAI. 2022. CHATGPT: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/
[136] Roy OrEl, Xuan Luo, Mengyi Shan, Eli Shechtman, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. 2022. StyleSDF: High-Resolution 3D-
Consistent Image and Geometry Generation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer
Society, Los Alamitos, CA, USA, 13493–13503. https://doi.org/10.1109/CVPR52688.2022.01314
[137] Achraf Oussidi and Azeddine Elhassouny. 2018. Deep generative models: Survey. In 2018 International Conference on Intelligent Systems and
Computer Vision (ISCV). Fez, Morocco, 1–8. https://doi.org/10.1109/ISACV.2018.8354080
[138] Güvenç Özel. 2020. Interdisciplinary AI: A Machine Learning System for Streamlining External Aesthetic and Cultural Influences in Architecture.
Singapore, 103–116. https://doi.org/10.1007/978-981-15-6568-7_7
[139] Matthew J. Page, Joanne E. McKenzie, Patrick M. Bossuyt, Isabelle Boutron, Tammy C. Hoffmann, Cynthia D. Mulrow, Larissa Shamseer, Jennifer M.
Tetzlaff, Elie A. Akl, Sue E. Brennan, Roger Chou, Julie Glanville, Jeremy M. Grimshaw, Asbjørn Hróbjartsson, Manoj M. Lalu, Tianjing Li,
Elizabeth W. Loder, Evan Mayo-Wilson, Steve McDonald, Luke A. McGuinness, Lesley A. Stewart, James Thomas, Andrea C. Tricco, Vivian A.
Welch, Penny Whiting, and David Moher. 2021. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. International
Journal of Surgery 88 (2021), 105906. https://doi.org/10.1016/j.ijsu.2021.105906
[140] M. R. Pavan Kumar and Prabhu Jayagopal. 2021. Generative adversarial networks: a survey on applications and challenges. International Journal of
Multimedia Information Retrieval 10, 1 (March 2021), 1–24. https://doi.org/10.1007/s13735-020-00196-w
[141] Wanyu Pei, Xiangmin Guo, and TianTian Lo. 2021. Detecting Virtual Perception Based on Multi-Dimensional Biofeedback - A Method to
Pre-Evaluate Architectural Design Objectives. In Proceedings of the 26th CAADRIA Conference. Hong Kong, 183–192. https://doi.org/10.52842/conf.
caadria.2021.2.183
[142] Wanyu Pei, TianTian Lo, and Xiangmin Guo. 2020. A Biofeedback Process: Detecting Architectural Space with the Integration of Emotion
Recognition and Eye-tracking Technology. In Proceedings of the 25th CAADRIA Conference. Bangkok, Thailand, 263–272. https://doi.org/10.52842/
conf.caadria.2020.2.263
[143] Drew D. Penney and Lizhong Chen. 2019. A Survey of Machine Learning Applied to Computer Architecture Design. arXiv:1909.12373 [cs.AR]
[144] Ravi Peters, Balázs Dukai, Stelios Vitalis, Jordi van Liempt, and Jantien Stoter. 2022. Automated 3D reconstruction of LoD2 and LoD1 models for all
10 million buildings of the Netherlands. , 165–170 pages. https://doi.org/10.14358/PERS.21-00032R2
[145] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. 2022. DreamFusion: Text-to-3D using 2D Diffusion. arXiv:2209.14988 [cs.CV]
[146] Samira Pouyanfar, Saad Sadiq, Yilin Yan, Haiman Tian, Yudong Tao, Maria Presa Reyes, Mei-Ling Shyu, Shu-Ching Chen, and S. S. Iyengar.
2018. A Survey on Deep Learning: Algorithms, Techniques, and Applications. ACM Comput. Surv. 51, 5, Article 92 (sep 2018), 36 pages.
https://doi.org/10.1145/3234150
[147] Guocheng Qian, Jinjie Mai, Abdullah Hamdi, Jian Ren, Aliaksandr Siarohin, Bing Li, Hsin-Ying Lee, Ivan Skorokhodov, Peter Wonka, Sergey
Tulyakov, and Bernard Ghanem. 2023. Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors.
arXiv:2306.17843 [cs.CV]
[148] Sameera Ramasinghe, Salman Khan, Nick Barnes, and Stephen Gould. 2020. Spectral-GANs for High-Resolution 3D Point-cloud Generation. In
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (Las Vegas, NV, USA). IEEE Press, Las Vegas, NV, USA, 8169–8176.
https://doi.org/10.1109/IROS45743.2020.9341265
[149] Lyle Regenwetter, Amin Heyrani Nobari, and Faez Ahmed. 2022. Deep Generative Models in Engineering Design: A Review. Journal of Mechanical
Design 144, 7 (03 2022), 071704. https://doi.org/10.1115/1.4053859
[150] Yue Ren and Hao Zheng. 2020. The Spire of AI - Voxel-based 3D Neural Style Transfer. In Proceedings of the 25th CAADRIA Conference. Bangkok,
Thailand, 619–628. https://doi.org/10.52842/conf.caadria.2020.2.619
[151] Radu Bogdan Rusu and Steve B. Cousins. 2011. 3D is here: Point Cloud Library (PCL). 2011 IEEE International Conference on Robotics and Automation
(2011), 1–4.
[152] Waddah Saeed and Christian Omlin. 2023. Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities.
Knowledge-Based Systems 263 (2023), 110273. https://doi.org/10.1016/j.knosys.2023.110273
[153] Aditya Sanghi, Hang Chu, Joseph G. Lambourne, Ye Wang, Chin-Yi Cheng, Marco Fumero, and Kamal Rahimi Malekshan. 2022. CLIP-Forge:
Towards Zero-Shot Text-to-Shape Generation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans,
Manuscript submitted to ACM
34 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

Louisiana, USA, 18582–18592. https://doi.org/10.1109/CVPR52688.2022.01805

[154] Victor Sardenberg and Mirco Becker Theron Burger. 2019. Aesthetic Quantification as Search Criteria in Architectural Design. In Blucher Design
Proceedings. Editora Blucher, Porto, Portugal, 17–24. https://doi.org/10.5151/proceedings-ecaadesigradi2019_088
[155] Iqbal H. Sarker. 2021. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions. SN Computer
Science 2, 6 (Aug. 2021), 420. https://doi.org/10.1007/s42979-021-00815-1
[156] Iqbal H. Sarker. 2021. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN Computer Science 2, 3 (March 2021),
160. https://doi.org/10.1007/s42979-021-00592-x
[157] Ralph Schroeder, Avon Huxor, and Andy Smith. 2001. Activeworlds: geography and social interaction in virtual reality. Futures 33, 7 (2001),
569–587. https://doi.org/10.1016/S0016-3287(01)00002-7
[158] Adam Sebestyen, Johanna Rock, and Urs Leonhard Hirschberg. 2021. Towards Abductive Reasoning-Based Computational Design Tools - Using
Machine Learning as a way to explore the combined design spaces of multiple parametric models. In Proceedings of the 39th eCAADe Conference.
Novi Sad, Serbia, 141–150. https://doi.org/10.52842/conf.ecaade.2021.1.141
[159] Adam Sebestyen, Ozan Özdenizci, Urs Hirschberg, and Robert Legenstein. 2023. Generating Conceptual Architectural 3D Geometries with
Denoising Diffusion Models. In Proceedings of the 41st Conference on Education and Research in Computer Aided Architectural Design in Europe
(eCAADe 2023). Graz Austria, 451–460. https://doi.org/10.52842/conf.ecaade.2023.2.451
[160] Pratheba Selvaraju, Mohamed Nabail, Marios Loizou, Maria Maslioukova, Melinos Averkiou, Andreas Andreou, Siddhartha Chaudhuri, and
Evangelos Kalogerakis. 2021. BuildingNet: Learning to Label 3D Buildings. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
online, 10377–10387. https://doi.org/10.1109/ICCV48922.2021.01023
[161] Liam Jordan Sheehan, Andre G.P. Brown, Marc Aurel Schnabel, and Tane Jacob Moleta. 2021. The Fourth Virtual Dimension - Stimulating
the Human Senses to Create Virtual Atmospheric Qualities. In Proceedings of the 26th CAADRIA Conference. Hong Kong, 213–222. https:
//doi.org/10.52842/conf.caadria.2021.2.213
[162] Tianchang Shen, Jun Gao, Kangxue Yin, Ming-Yu Liu, and Sanja Fidler. 2021. Deep Marching Tetrahedra: a Hybrid Representation for High-
Resolution 3D Shape Synthesis. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., online, 6087–6101. https:
//proceedings.neurips.cc/paper_files/paper/2021/file/30a237d18c50f563cba4531f1db44acf-Paper.pdf
[163] Yang Shi, Tian Gao, Xiaohan Jiao, and Nan Cao. 2023. Understanding Design Collaboration Between Designers and Artificial Intelligence: A
Systematic Literature Review. Proc. ACM Hum.-Comput. Interact. 7, CSCW2, Article 368 (oct 2023), 35 pages. https://doi.org/10.1145/3610217
[164] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. 2024. MVDream: Multi-view Diffusion for 3D Generation.
arXiv:2308.16512 [cs.CV]
[165] Zifan Shi, Sida Peng, Yinghao Xu, Andreas Geiger, Yiyi Liao, and Yujun Shen. 2023. Deep Generative Models on 3D Representations: A Survey.
arXiv:2210.15663 [cs.CV]
[166] Dongwook Shu, Woo Sung Park, and Junseok Kwon. 2019. 3D Point Cloud Generative Adversarial Network Based on Tree Structured Graph
Convolutions. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 3858–3867.
https://doi.org/10.1109/ICCV.2019.00396
[167] Jesper Simonsen, Helena Karasti, and Morten Hertzum. 2020. Infrastructuring and Participatory Design: Exploring Infrastructural Inversion as
Analytic, Empirical and Generative. Comput. Supported Coop. Work 29, 1–2 (apr 2020), 115–151. https://doi.org/10.1007/s10606-019-09365-w
[168] Kyle Steinfeld, Katherine Par, Adam Menges, and Samantha Walker. 2019. Fresh Eyes: A Framework for the Application of Machine Learning to
Generative Architectural Design, and a Report of Activities at Smartgeometry 2018. In Computer-Aided Architectural Design." Hello, Culture" 18th
International Conf., CAAD Futures 2019, Daejeon, S. Korea, June 26–28, 2019. Springer, Daejeon, South Korea, 32–46. https://doi.org/10.1007/978-
981-13-8410-3_3
[169] Todor Stojanovski et al. 2022. Rethinking Computer-Aided Architectural Design (CAAD) – From Generative Algorithms and Architectural
Intelligence to Environmental Design and Ambient Intelligence. In Computer-Aided Architectural Design. Design Imperatives: The Future is Now.
Springer Singapore, Singapore, 62–83. https://doi.org/10.1007/978-981-19-1280-1_5
[170] Robert Stuart-Smith et al. 2022. Visual Character Analysis within Algorithmic Design: Quantifying Aesthetics Relative to Structural and Geometric
Design Criteria. In Proceedings of the 27th CAADRIA Conference. Sydney, Australia, 131–140. https://doi.org/10.52842/conf.caadria.2022.1.131
[171] Martin Tamke, Paul Nicholas, and Mateusz Zwierzycki. 2018. Machine learning for architectural design: Practices and infrastructure. International
Journal of Architectural Computing 16, 2 (2018), 123–143. arXiv:https://doi.org/10.1177/1478077118778580
[172] Christian Tonn. 2017. Designing Colour in Virtual Reality - Comparing a Virtual Reality based and a Screen based Colour Design Method. In
Proceedings of the 35th eCAADe Conference. Rome, Italy. https://doi.org/10.52842/conf.ecaade.2017.2.721
[173] Maria E. Tosello. 2003. Performing Cyberspace: Dance, Technology and Virtual Architecture. International Journal of Architectural Computing 1, 3
(2003), 393–413. arXiv:https://doi.org/10.1260/147807703322987129
[174] Bhuman Vyas. 2022. Ethical Implications of Generative AI in Art and the Media. International Journal for Multidisciplinary Research (IJFMR), E-ISSN
4, 4 (2022), 2582–2160. https://doi.org/10.36948/ijfmr.2022.v04i04.9392
[175] Can Wang et al. 2022. CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields. In 2022 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). New Orleans, Louisiana, USA, 3825–3834. https://doi.org/10.1109/CVPR52688.2022.00381
[176] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. 2023. ProlificDreamer: High-Fidelity and Diverse Text-to-3D
Generation with Variational Score Distillation. arXiv:2305.16213 [cs.CV]
Manuscript submitted to ACM
A Survey on Deep Learning for Design and Generation of Virtual Architecture 35

[177] Zhengwei Wang, Qi She, and Tomás E. Ward. 2021. Generative Adversarial Networks in Computer Vision: A Survey and Taxonomy. ACM Comput.
Surv. 54, 2, Article 37 (feb 2021), 38 pages. https://doi.org/10.1145/3439723
[178] Cameron Wells, Marc Aurel Schnabel, Tane Jacob Moleta, and Andre G.P. Brown. 2021. Beauty is in the Eye of the Beholder - Improving the
Human-Computer Interface within VRAD by the Active and Two-Way Employment of Our Visual Senses. In Proceedings of the 26th CAADRIA
Conference. Hong Kong, 355–364. https://doi.org/10.52842/conf.caadria.2021.2.355
[179] Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T. Freeman, and Joshua B. Tenenbaum. 2016. Learning a Probabilistic Latent Space of Object
Shapes via 3D Generative-Adversarial Modeling. In Proceedings of the 30th International Conference on Neural Information Processing Systems
(Barcelona, Spain) (NIPS’16). Curran Associates Inc., Red Hook, NY, USA, 82–90. https://doi.org/10.5555/3157096.3157106
[180] Rundi Wu, Yixin Zhuang, Kai Xu, Hao Zhang, and Baoquan Chen. 2020. PQ-NET: A Generative Part Seq2Seq Network for 3D Shapes. In 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA, 826–835. https://doi.org/10.1109/CVPR42600.2020.00091
[181] Zhijie Wu, Xiang Wang, Di Lin, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. 2019. SAGNet: Structure-Aware Generative Network for
3D-Shape Modeling. ACM Trans. Graph. 38, 4, Article 91 (jul 2019), 14 pages. https://doi.org/10.1145/3306346.3322956
[182] Weihao Xia and Jing-Hao Xue. 2023. A Survey on Deep Generative 3D-aware Image Synthesis. ACM Comput. Surv. 56, 4, Article 90 (nov 2023),
34 pages. https://doi.org/10.1145/3626193
[183] Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. 2019. PointFlow: 3D Point Cloud Generation With
Continuous Normalizing Flows. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, Seoul, South Korea,
4540–4549. https://doi.org/10.1109/ICCV.2019.00464
[184] Qian Yang, Jina Suh, Nan-Chen Chen, and Gonzalo Ramos. 2018. Grounding interactive machine learning tool design in how non-experts actually
build models. In Proceedings of the 2018 Designing Interactive Systems Conference. 573–584. https://doi.org/doi/10.1145/3196709.3196729
[185] Zhihang Yao et al. 2018. 3DCityDB - a 3D Geodatabase Solution for the Management, Analysis, and Visualization of Semantic 3D City Models
based on CityGML. Open Geospatial Data, Software and Standards 3, 1 (May 2018), 5. https://doi.org/10.1186/s40965-018-0046-7
[186] De Yu. 2020. Reprogramming Urban Block by Machine Creativity - How to use neural networks as generative tools to design space. In Proceedings
of the 38st Conference on Education and Research in Computer Aided Architectural Design in Europe (eCAADe). Berlin, Germany, 249–258. https:
//doi.org/10.52842/conf.ecaade.2020.1.249
[187] Wangbo Yu et al. 2024. HiFi-123: Towards High-fidelity One Image to 3D Content Generation. arXiv:2310.06744 [cs.CV]
[188] Anny Yuniarti and Nanik Suciati. 2019. A Review of Deep Learning Techniques for 3D Reconstruction of 2D Images. In 2019 12th International
Conference on Information & Communication Technology and System (ICTS). Surabaya, Indonesia, 327–331. https://doi.org/10.1109/ICTS.2019.8850991
[189] Maciej Zamorski, Maciej Ziundefinedba, Piotr Klukowski, Rafał Nowak, Karol Kurach, Wojciech Stokowiec, and Tomasz Trzciński. 2020. Adversarial
Autoencoders for Compact Representations of 3D Point Clouds. Comput. Vis. Image Underst. 193, C (apr 2020), 8 pages. https://doi.org/10.1016/j.
cviu.2020.102921
[190] Maryam Zarei et al. 2021. Design and Development of Interactive Systems for Integration of Comparative Visual Analytics in Design Workflow. In
Proceedings of the 26th CAADRIA Conference. Hong Kong, 121–130. https://doi.org/10.52842/conf.caadria.2021.2.121
[191] Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, and Karsten Kreis. 2022. LION: Latent Point Diffusion Models
for 3D Shape Generation. arXiv:2210.06978 [cs.CV]
[192] Hang Zhang. 2019. 3D Model Generation on Architectural Plan and Section Training through Machine Learning. Technologies 7, 4 (2019).
https://doi.org/10.3390/technologies7040082
[193] Hang Zhang. 2020. Text-to-Form: 3D Prediction by Linguistic Descriptions. In Proceedings of the 40th Annual Conference of the Association for
Computer Aided Design in Architecture (ACADIA). Online and Global, 238–247. https://doi.org/10.52842/conf.acadia.2020.1.238
[194] Hang Zhang and Ye Huang. 2021. Machine Learning Aided 2D-3D Architectural Form Finding at High Resolution. Springer, Singapore, 159–168.
https://doi.org/10.1007/978-981-33-4400-6_15
[195] Song-Hai Zhang, Shao-Kui Zhang, Yuan Liang, and Peter Hall. 2019. A Survey of 3D Indoor Scene Synthesis. Journal of Computer Science and
Technology 34, 3 (May 2019), 594–608. https://doi.org/10.1007/s11390-019-1929-5
[196] Hao Zheng and Philip F Yuan. 2021. A Generative Architectural and Urban Design Method through Artificial Neural Networks. Building and
Environment 205 (2021), 108178. https://doi.org/10.1016/j.buildenv.2021.108178
[197] Zerong Zheng et al. 2021. Deep Implicit Templates for 3D Shape Representation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). Nashville, TN, USA, 1429–1439. https://doi.org/10.1109/CVPR46437.2021.00148
[198] Linqi Zhou, Yilun Du, and Jiajun Wu. 2021. 3D Shape Generation and Completion through Point-Voxel Diffusion. In 2021 IEEE/CVF International
Conference on Computer Vision (ICCV). online, 5806–5815. https://doi.org/10.1109/ICCV48922.2021.00577
[199] Gizem Özerol and Semra Arslan Selçuk. 2023. Machine Learning in the Discipline of Architecture: A Review on the Research Trends between 2014
and 2020. International Journal of Architectural Computing 21, 1 (2023), 23–41. https://doi.org/10.1177/14780771221100102

Manuscript submitted to ACM

36 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

APPENDIX
A SUPPLEMENTARY MATERIAL (E-PUB ONLY)
This material is supplementary and should be included in the electronic version only.

Table 1. Overview of the mission and scope of virtual architecture, focusing on five key aspects: spatial forms, user presence,
socialization, interaction, and interoperability and scalability.

Discipline Description
Spatial forms Incorporating diverse forms in terms of function, style, and purpose to ensure efficient production to achieve large-scale
outcomes.
User presence, Ensuring self-presence with a pleasant, understandable user experience, considering human perception (e.g., comfort, safety),
experience behavioral demands (e.g., navigation, gathering activities), and environmental factors (e.g., lighting, sounds, atmosphere) in
immersive environments.
Socialization, Supporting diverse and multiple social interactions, including activities, engagements, and goals for live, entertainment,
engagement and work. Providing collaboration opportunities with integrated tools for editing, altering, and customizing tasks.
Interactive el- Designing flexible, dynamic, and real-time interactive human-centric buildings by utilizing ubiquitous data from environ-
ements mental intelligence, such as responsive or interactive building elements.
Interoperability, Building interoperability and scalability across multiple platforms focusing on data security, user privacy, and resource
scalability management for sustainability.

Table 2. This article elucidates four key focuses derived from the research questions reviewed for deep learning-assisted architectural
design papers: datasets, multimodality, design intuition, and generative models (depicted as four columns on the right). A selection of
the reviewed papers is provided in the table as a reference.

3D Repre- Deep Key Focus

Reference Category Research Question
sentation Generative Dataset Multi- Design Generative
Model Modality Intu- Model
ition
[20] 3D Solid Extend design cognition Point cloud/Mesh GAN (with Autoencoder) ✓ ✓
[120] 3D Solid Generate building massing Point cloud/Voxel Grid GAN ✓ ✓ ✓
[114] 3D Solid Generation, manipulation and form finding of struc- Wireframe VAE ✓ ✓
tural typologies
[83] 3D Solid Solve design problems incorporating deep learning Voxel grid Based on VAE & GAN ✓ ✓ ✓
[158] 3D Solid New way to design models Voxel grid VAE ✓ ✓ ✓
[60] Text-3D Solid Multimodal generation with text input - Diffusion model ✓ ✓

Manuscript submitted to ACM

A Survey on Deep Learning for Design and Generation of Virtual Architecture 37

Table 3. An overview of 3D generative approaches for shape generation, each allowing the creation of editable models for explicit
representations.

Method Names Publication & Year 3D Representation Deep Generative Model

3D GAN [179] NIPS 2016 Voxel grid GAN
Text2Shape [28] ACCV 2018 Voxel grid GAN
PLATONICGAN [67] CVPR 2019 Voxel grid GAN
IG GAN [111] arXiv 2020 Voxel grid GAN
Achlioptas et al. [2] ICML 2018 Point cloud GAN
Shu et al. [166] CVPR 2019 Point cloud GAN
Get3d [53] NeurIPS 2022 Mesh GAN
IM-Net [29] CVPR 2019 Neural field GAN
Kleineberg et al. [92] arXiv 2020 Neural field GAN
Brock et al. [18] arXiv 2016 Voxel grid VAE
Autosdf [117] CVPR 2022 Voxel gird VAE
Sagnet [181] TOG 2019 Voxel gird VAE
Li et al. [101] AAAI 2020 Voxel gird VAE
Pq-net [180] CVPR 2020 Voxel gird VAE
AdversarialAE [189] CVPR 2020 Voxel gird VAE
Multi-Chart [16] TOG 2018 Mesh VAE
SDM-NET [55] TOG 2019 Mesh VAE
Tm-net [54] TOG 2021 Mesh VAE
Polygen [123] ICML 2020 Mesh VAE
PointFLow [183] CVPR 2019 Point cloud Normaliz. flow model
CLIP-Forge [153] CVPR 2022 Voxel grid Normaliz. flow model
DreamFusion [145] arXiv 2022 Mesh Diffusion model
PVD [198] CVPR 2021 Hybrid: point cloud-voxel Diffusion model
Magic3D [103] arXiv 2022 Neural field-Mesh Diffusion model
LION [191] arXiv 2022 Mesh Diffusion model
Point-E [130] arXiv 2022 Neural field-Point cloud Diffusion model
Magic123 [147] arXiv 2023 Mesh Diffusion model
Prolificdreamer [176] arXiv 2023 Mesh Diffusion model
Hifi-123 [187] arXiv 2024 Mesh Diffusion model
MVDream [164] arXiv 2024 Mesh Diffusion model

Table 4. An overview of 3D generative approaches of 3D-aware image synthesis. ‘Single/multiple’ represents the result generated by
a single image adopting a sample of single-view or multiple images adopting multiple-view images. ‘Geometry’ indicates whether
this method allows the export to mesh. ‘Editability’ refers to the ability to directly manipulate or modify the synthesized output,
such as adding or removing objects or adjusting their properties. ‘Controllability’ refers to the ability to adjust variables during the
generation process, such as camera pose, object position, and lighting.

Publication 3D Repre- Single/multi- Controllability

Method Names Geometry Editability Highlight
& Year sentation ple-view Camera Object
NeRF [116] CVPR 2019 Neural field multiple - - ✓ ✓- 3D Novel View
Synthesis
HoloGAN [129] CVPR 2019 Voxel grid single ✓ - ✓ - -
Pi-GAN [27] CVPR 2021 Neural field single ✓ - - - -
Giraffe [132] CVPR 2021 Neural field single - ✓ ✓ ✓ -
StyleSDF [136] CVPR 2022 Neural field single ✓ - ✓ - -
StyleNeRF [59] arXiv 2021 Neural field single ✓ - ✓ - -
DreamField [78] CVPR 2022 Neural field multiple ✓ ✓ - Text input (CLIP)
DreamFusion [145] arXiv 2022 Neural field single ✓ ✓ ✓ ✓ Text input
CLIP-NeRF [175] CVPR 2022 Neural field single - ✓ ✓ ✓ Text input (CLIP)

Manuscript submitted to ACM

38 Anqi Wang, Jiahua Dong, Lik-Hang Lee, Jiachuan Shen, and Pan Hui

Table 5. The related works in architecture are classified into three categories: 2D to 3D transposition, 3D solid generation, and
text-based 3D form generation. The ‘Category’ column denotes the generation approaches: ‘2D to 3D’ refers to generating 3D forms
from 2D images; ‘3D Solid’ involves directly generating 3D forms using DGMs such as GANs, VAEs, 3D-aware image synthesis, and
diffusion models; ‘Text-3D Solid’ indicates that the 3D form generation process incorporates text input to control the output. The
‘Objective’ column describes the research directions and goals. The ‘Methodology’ column outlines the proposed workflow for the
generation process. ‘3D Representation’ indicates the formats of the generated data, including point cloud, voxel grid, and mesh.
‘Generative Model’ specifies the type of deep generative model used in each work.

Reference Category Obejctive Methodology 3D Representa- Generative

tion Model
[38] 2D to 3D Test AI agency in design Utilizing Style Transfer to train two datasets Baroque and Modern im- - -
ages as a basis to form a 3D model
[138] 2D to 3D HCI in urban design Utilizing Style Transfer to generate different stylized images and gener- - -
ate 3D geometry through procedural modeling
[150] 2D to 3D Test AI agency in design Utilizing Style Transfer to replace pixels with voxelization units to - -
generate 3D forms
[105] 2D to 3D Toolkits for 3D generation Utilizing Style Transfer to assist the generation of 3D structure from 2D - -
images
[186] 2D to 3D Generate building massing Utilizing Pix2Pix GAN to generate plan pattern and section pattern, - -
then converted to 3D massing
[42] 2D to 3D Generate building massing Utilizing Pix2Pix GAN to generate urban morphology to create building - -
massing
[192] 2D to 3D Form finding to assist de- Slicing a 3D model and trained with different combinations of 2D Style- - -
sign GANnetworks, and finally stitching into a 3D model
[192] 2D to 3D Form finding to assist de- 3D model generation based on 2D plan and section using Style Transfer - -
sign
[194] 2D to 3D Form finding to assist de- Combining the spatial sequence information to generate 3D form from - -
sign 2D images through multi-level deep generative networks such as Style-
GAN
[11] 2D to 3D Human and neural net- Utilizing 3D solid for training to map spatial semantics to a latent space Point cloud GAN
work interface assembled using point cloud representations
[8] 2D to 3D Integrate latent space in de- GAN allows for navigation in the latent space to create digital designs Voxel grid GAN
sign using vector arithmetic and interpolation techniques, then converting
resulting images to 3D voxel structures
[72] 2D to 3D Recognize 2D pattern to 3D Utilizing Latent space rotation and perspective projection to generate Voxel grid GAN
form 3D model
[89] 2D to 3D Recognize 2D pattern to 3D Utilizing Latent space rotation and perspective projection to generate Voxel grid GAN
form 3D model
[134] 2D to 3D Generate building massing Employing voxelization and vectorization techniques on images gener- Voxel grid GAN
ated by the Pix2Pix GAN model to generate 3D building massing
[20] 3D Solid Extend design cognition Utilizing a GAN with a pair of encoder-decoder to process datasets and Point cloud/Mesh GAN (with Au-
generate new 3D models, resulting in different 3D representations like toencoder)
point cloud and mesh
[120] 3D Solid Generate building massing Utilizing BuildingNet dataset [160] as a basis to train urban morphology - GAN
data to generate 3D building massing
[114] 3D Solid Generation, manipulation Utilizing VAE to learn continuous latent space to generate new geome- Voxel grid VAE
and form finding of struc- tries
tural typologies
[83] 3D Solid Solve design problems in- Develop post-integration in the generation workflow using VAE and Voxel grid Base on VAE &
corporating deep learning GAN to solve complex multi-objective with abstract design criteria GAN
[158] 3D Solid New way to design models Utilizing VAE to encode and decode geometries through dimensionality Voxel grid VAE
manipulation
[108] 3D Solid Form finding to assist in de- Developing a tool with a user interface (UI) to generate 3D point clouds Point cloud Diffusion model
sign using a diffusion model.
[159] 3D Solid Form finding to assist de- Utilizing diffusion model to generate 3D forms with voxel grids through Voxel grid Diffusion model
sign parametric design as dataset
[61] Text-3D Solid Multimodal generation Utilizing diffusion model and NeRF to generate 3D forms from text input Mesh Diffusion model
with text input with multimodality & NeRF
[94] Text-3D Solid Test the connection be- Employing a specialized prompt to manipulate DreamField to generate Mesh Diffusion model
tween text and 3D form multiple design alternatives for architects
[60] Text-3D Solid Multimodal generation Utilizing diffusion model to generate 3D forms from text input with - Diffusion model
with text input multimodality