0% found this document useful (0 votes)
45 views23 pages

04 Synthetic Data Generation

Uploaded by

Rania Saleh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views23 pages

04 Synthetic Data Generation

Uploaded by

Rania Saleh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Modeling and Simulation

Synthetic Data Generation

Dr. Iman Abu Hashish


Department of Data Science and Artificial Intelligence
e-mail: [email protected]
Phone Ext.: 5104
Office: SB-G04

Fall Semester
Academic Year 2024/2025
Table of Contents

1. Objectives

2. What is Synthetic Data?


Synthesis from Real Data
Synthesis without Real Data
Synthesis and Utility

3. Why Synthetic Data?

4. Synthetic Case Studies

5. The Synthetic Customer Churn Use Case

6. References

Synthetic Data Generation I. Abu Hashish 2


Objectives
Objectives

• Introduce the concept of synthetic data.


• Differentiate between different types of synthetic data.
• Understand the concept of utility in synthetic data generation.
• Explain and explore the synthetic customer churn data use case.

Synthetic Data Generation I. Abu Hashish 4


What is Synthetic Data?
What is Synthetic Data?

Synthetic data is not real data – but we kinda want it to be real.

• It is one that has been generated from real data – and that has
similar statistical properties.
• The degree to which a synthetic dataset is an accurate proxy for
real data is a measure of utility – can we really use it?
• Synthesis is the process of generating synthetic data.
• Synthetic data can be of different forms – structured,
semi-structured, or unstructured.
• There are three types of synthetic data – generated from real
data, does not use real data, and a hybrid of these two.

Synthetic Data Generation I. Abu Hashish 6


Synthesis from Real Data

Say, a data science group specializing in understanding customer


behaviors would need large amounts of data to build its models. But
the process for accessing that customer data is slow and does not
provide good enough data. What to do?

• Some real datasets are available – but we need more!


• The data is explored to identify its nature – number of variables,
data types, distributions, etc.
• A model is built to capture the nature of the data to sample or
generate the data.
• The model is considered a good representation if the synthetic
data has statistical properties similar to those of the real data.

Synthetic Data Generation I. Abu Hashish 7


Synthesis from Real Data

Conceptual process of synthesis from real data.

Synthetic Data Generation I. Abu Hashish 8


Synthesis without Real Data

The second type of synthetic data is not generated from real data,
but rather from existing models or domain experience.

• Existing models can be statistical models of a process or


simulation – customer arrivals or a simulation engine that
simulate customer characteristics.
• Background knowledge enables a data scientist to create a
model and sample – movement of stock prices, financial
markets behavior, etc.
• If the knowledge is accurate, the synthetic data will behave in a
manner that is consistent with the real world.
• Phenomena of interest must be well-understood – what if the
phenomena was new?

Synthetic Data Generation I. Abu Hashish 9


Synthesis and Utility

For some use cases, having high utility will matter quite a bit – in
other cases, medium or even low utility may be acceptable.

Can you think of possible use cases?

Synthetic Data Generation I. Abu Hashish 10


Why Synthetic Data?
Why Synthetic Data?

Two important benefits of data synthesis – providing more efficient


access to data and enabling better analytics.

• Data access is critical to artificial intelligence and machine


learning projects – train, test, and validate models, evaluating
technologies, testing so tware, etc.
• Privacy concerns and consent are always a challenge – data
synthesis solves such issues.
• If a phenomena is new, data may not be available or would be
costly or its collection may be impractical – data synthesis
comes to the rescue!
• Analysts can use the synthetic data models to validate their
assumptions – data synthesis for exploration.

Synthetic Data Generation I. Abu Hashish 12


Synthetic Case Studies
Synthetic Case Studies

Illustrative application examples:

• Manufacturing and distribution – the use of industrial robots,


sensors, IoT devices, etc.
• Healthcare – data access is the problem!
• Financial services – costly large amount of historical data.
• Transportation – the need to make very specific planning and
policy decisions about infrastructure in a data-limited
environment.

Synthetic Data Generation I. Abu Hashish 14


The Synthetic Customer Churn
Use Case
The Synthetic Customer Churn Use Case

Customer churn refers to the loss of customers or clients by a


business over a specific period of time. The synthetic data to be
generates consists of 8 features and 1000 observations.

• Age: age for customers.


• MonthlyCharges: monthly charges paid by the customers.
• ServiceCalls: number of customer service calls.
• Gender: gender for customers.
• InternetService: internet services provided to the customers.
• Contract: contract types signed by the customers.
• PaymentMethod: payment method used by the customers.
• Churn: binary churn value for customers.

Synthetic Data Generation I. Abu Hashish 16


The Synthetic Customer Churn Use Case

Synthetic Data Generation I. Abu Hashish 17


The Synthetic Customer Churn Use Case

Synthetic Data Generation I. Abu Hashish 18


The Synthetic Customer Churn Use Case

Synthetic Data Generation I. Abu Hashish 19


The Synthetic Customer Churn Use Case

Synthetic Data Generation I. Abu Hashish 20


The Synthetic Customer Churn Use Case

Synthetic Data Generation I. Abu Hashish 21


References
References

• Giuseppe Ciaburro. 2022. Hands-On Simulation Modeling with


Python. Packt Publishing.
• Sheldon Ross. 2022. Simulation. Elsevier Academic Press.
• Khaled El Emam, Lucy Mosquera & Richard Hoptroff. 2020.
Practical Synthetic Data Generation. O’reilly.

Synthetic Data Generation I. Abu Hashish 23

You might also like