Skip to content
/ ASH Public

Official implementation of "ASH: A Robust Automated Evaluation Framework for Creative Text Generation in Cuisine Transfer Tasks" (Submitted to IEEE Transactions on Big Data). This repo includes the ASH benchmark data, recipe generation codes, and prompt engineering experiments for evaluating LLM-as-a-judge reliability.

Notifications You must be signed in to change notification settings

dmis-lab/ASH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

ASH: A Robust Automated Evaluation Framework for Creative Text Generation in Cuisine Transfer Tasks

DOI

Submitted to IEEE Transactions on Big Data

Overview

This repository provides the code and data for "ASH: A Robust Automated Evaluation Framework for Creative Text Generation in Cuisine Transfer Tasks". This research extends our previous work ("Culinary Class Wars") by introducing a rigorous meta-evaluation of the ASH (Authenticity, Sensitivity, Harmony) framework.

In this extended study, we not only evaluate Large Language Models (LLMs) on cuisine transfer tasks but also rigorously validate the "LLM-as-a-judge" paradigm itself. We implement and compare eight distinct prompt engineering strategies (ranging from simple scoring to Chain-of-Thought) to identify the most human-aligned evaluation methodology.

Table of Contents

Project Structure

The repository is organized into two main parts: data and code, including the new prompt_engineering module.

.
├── data
│   ├── generation
│   │   └── v0_recipes.csv                 # Recipes generated by LLMs (4,800 recipes)
│   └── evaluation
│       ├── 5-round                        # Five evaluations per recipe (Baseline)
│       │   ├── v0_recipes_eval_5_ollama.csv
│       │   └── ...
│       └── human                          # Human annotated ground truth
│           ├── human_ground_truth_200.csv # 200 recipes evaluated by diverse humans
└── code
    ├── generation                         # Recipe generation scripts
    │   └── generate_recipes.py
    ├── evaluation                         # Standard ASH evaluation scripts
    │   ├── evaluate_recipes_5_ollama.py
    │   └── ...
    └── prompt_engineering                 # [NEW] Prompt Optimization Experiments
        └── evaluate_recipes_prompt_check_ollama.py # Script for evaluating recipes with 8 prompt strategies

Setup

  1. Clone the repository:
git clone [https://github.com/dmis-lab/ASH.git](https://github.com/dmis-lab/ASH.git)
cd ASH-Framework
  1. Install dependencies:
pip install -r requirements.txt
  1. API Keys: Ensure you have the necessary API keys configured:
  • OpenAI API key: ../API_KEY/API_KEY_openai.txt
  • Google Gemini API key: ../API_KEY/API_KEY_gemini.txt

Data Description

  • data/generation: Contains 4,800 recipes generated by 6 models across 40 cuisines.
  • data/evaluation: Contains baseline ASH evaluation results.
  • data/prompt_experiments: Contains the outputs of the meta-evaluation, where different prompt strategies (e.g., CoT, Role-Playing) were tested against human ground truth to find the optimal evaluator.

How to Run: Generation & Evaluation

1. Recipe Generation

Generate recipes using the standardized prompt template.

python code/generation/generate_recipes.py --model mistral:7b --output data/generation/recipes_mistral.csv

2. Standard ASH Evaluation (Baseline)

Evaluate the generated recipes using the default scoring prompt.

python code/evaluation/5-round/evaluate_recipes_5_ollama.py data/generation/v0_recipes.csv

How to Run: Prompt Engineering Experiments

This section reproduces the meta-evaluation experiments (Table III in the paper) to identify the optimal prompt strategy.

Evaluate with 8 Prompt Strategies

Run the comprehensive evaluation script. This script utilizes multiprocessing to distribute tasks across available GPUs and evaluates recipes using 8 distinct prompt strategies (Default, Role-Playing, Scoring Scale, CoT, etc.) and multiple evaluator models.

Usage: Ensure your Ollama server is running and the required models (e.g., gemma2:9b, mistral:7b, llama3.1:8b) are pulled.

# Run the prompt check script
python code/prompt_engineering/evaluate_recipes_prompt_check_ollama.py

Expected Output: Rankings of prompt strategies based on MSE. (e.g., Strategy 3: Scoring Scale Specification typically yields the lowest MSE).

Results

  • Generative Capability: Comparison of 6 LLMs showing the trade-off between Sensitivity (Style) and Authenticity (Substance).
  • Evaluator Reliability: The "Scoring Scale Specification" strategy was found to be more robust (MSE 1.087) than complex Chain-of-Thought prompts, highlighting a "Complexity Paradox" in automated evaluation.

Contributors

Name Affiliation Email
Hoonick Lee (First Author) Dept. of Computer Science & Engineering, Korea University [email protected]
Mogan Gim Dept. of Biomedical Engineering, Hankuk University of Foreign Studies [email protected]
Donghyeon Park Dept. of AI and Data Science, Sejong University [email protected]
Donghee Choi† School of Computer Science & Engineering, Pusan National University [email protected]
Jaewoo Kang† Dept. of Computer Science & Engineering, Korea University [email protected]

† Corresponding Authors

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) under grant No. NRF-2023R1A2C3004176. This work was also supported by Hankuk University of Foreign Studies Research Fund (of 2025) and a New Faculty Research Grant of Pusan National University, 2025.

About

Official implementation of "ASH: A Robust Automated Evaluation Framework for Creative Text Generation in Cuisine Transfer Tasks" (Submitted to IEEE Transactions on Big Data). This repo includes the ASH benchmark data, recipe generation codes, and prompt engineering experiments for evaluating LLM-as-a-judge reliability.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages