SDK Documentation | Platform Documentation | Usage Examples
The official SDK of MOSTLY AI, a Python toolkit for high-fidelity, privacy-safe Synthetic Data.
- Client mode connects to a remote MOSTLY AI platform for training & generating synthetic data there.
- Local mode trains and generates synthetic data locally on your own compute resources.
- Generators, that were trained locally, can be easily imported to a platform for further sharing.
The SDK allows you to programmatically create, browse and manage 3 key resources:
- Generators - Train a synthetic data generator on your existing tabular or language data assets
- Synthetic Datasets - Use a generator to create any number of synthetic samples to your needs
- Connectors - Connect to any data source within your organization, for reading and writing data
| Intent | Primitive | Documentation |
|---|---|---|
| Train a Generator on tabular or language data | g = mostly.train(config) |
see mostly.train |
| Generate any number of synthetic data records | sd = mostly.generate(g, config) |
see mostly.generate |
| Live probe the generator on demand | df = mostly.probe(g, config) |
see mostly.probe |
| Connect to any data source within your org | c = mostly.connect(config) |
see mostly.connect |
Client mode only
pip install -U mostlyaiClient + Local mode
pip install -U 'mostlyai[local]' # for CPU
#pip install -U 'mostlyai[local-gpu]' # for GPUNOTE: installing mostlyai[local] on Linux requires --extra-index-url https://download.pytorch.org/whl/cpu to be specified.
Optional Connectors
Add any of the following extras for further data connectors support: databricks, googlebigquery, hive, mssql, mysql, oracle, postgres, snowflake.
E.g.
pip install -U 'mostlyai[local, databricks, snowflake]'For client mode, initialize with base_url and api_key obtained from your account settings page. For local mode, initialize the client simply with local=True.
import pandas as pd
from mostlyai.sdk import MostlyAI
# 1) Initialize the SDK in local or client mode
mostly = MostlyAI(local=True)
# mostly = MostlyAI(base_url='https://app.mostly.ai', api_key='YOUR_API_KEY')
# 2) Load your original data
trn_df = pd.read_csv('https://github.com/mostly-ai/public-demo-data/raw/dev/census/census.csv.gz')
# 3) Train a synthetic data generator
g = mostly.train(name='census', data=trn_df) # shorthand syntax for 1-table config
# 4) Live probe small synthetic samples
df_samples = mostly.probe(g, size=10)
# 5) Generate a full synthetic dataset
sd = mostly.generate(g, size=100_000)
syn_df = sd.data()- Broad Data Support
- Mixed-type data (categorical, numerical, geospatial, text, etc.)
- Single-table, multi-table, and time-series
- Multiple Model Types
- TabularARGN for SOTA tabular performance
- Fine-tune HuggingFace-based language models
- Efficient LSTM for text synthesis from scratch
- Advanced Training Options
- GPU/CPU support
- Differential Privacy
- Progress Monitoring
- Automated Quality Assurance
- Quality metrics for fidelity and privacy
- In-depth HTML reports for visual analysis
- Flexible Sampling
- Up-sample to any data volumes
- Conditional generation by any columns
- Re-balance underrepresented segments
- Context-aware data imputation
- Statistical fairness controls
- Rule-adherence via temperature
- Seamless Integration
- Connect to external data sources (DBs, cloud storages)
- Fully permissive open-source license
Please consider citing our project if you find it useful:
@software{mostlyai,
author = {{MOSTLY AI}},
title = {{MOSTLY AI SDK}},
url = {https://github.com/mostly-ai/mostlyai},
year = {2025}
}