JAX-CORL

This repository aims JAX version of CORL, clean single-file implementations of offline RL algorithms with solid performance reports.

🌬️ Persuing fast training: speed up via jax functions such as jit and vmap.
🔪 As simple as possible: implement minimum requirements.
💠 Focus on a few battle-tested algorithms: Refer here.
📈　Solid performance report (README, Wiki).

JAX-CORL is complementing the single-file RL ecosystem by offering the combination of offline x JAX.

CleanRL: Online x PyTorch
purejaxrl: Online x JAX
CORL: Offline x PyTorch
JAX-CORL(ours): Offline x JAX

Algorithms

Algorithm	implementation	training time (CORL)	training time (ours)	wandb
AWAC	algos/awac.py	4.46h	11m(24x faster)	link
IQL	algos/iql.py	4.08h	9m(28x faster)	link
TD3+BC	algos/td3_bc.py	2.47h	9m(16x faster)	link
CQL	algos/cql.py	11.52h	56m(12x faster)	link
DT	algos/dt.py	42m	11m(4x faster)	link

Training time is for 1000_000 update steps without evaluation for halfcheetah-medium-expert v2 (little difference between different D4RL mujoco environments). The training time of ours includes the compile time for jit. The computations were performed using four GeForce GTX 1080 Ti GPUs. PyTorch's time is measured with CORL implementations.

Reports for D4RL mujoco

Normalized Score

Here, we used D4RL mujoco control tasks as the benchmark. We reported the mean and standard deviation of the average normalized score of 5 episodes over 5 seeds. We plan to extend the verification to other D4RL benchmarks such as AntMaze. For those who would like to know about the source of hyperparameters and the validity of the performance, please refer to Wiki.

env	AWAC	IQL	TD3+BC	CQL	DT
halfcheetah-medium-v2	$41.56\pm0.79$	$43.28\pm0.51$	$48.12\pm0.42$	$48.65\pm 0.49$	$42.63 \pm 0.53$
halfcheetah-medium-expert-v2	$76.61\pm 9.60$	$92.87\pm0.61$	$92.99\pm 0.11$	$53.76 \pm 14.53$	$70.63\pm 14.70$
hopper-medium-v2	$51.45\pm 5.40$	$52.17\pm2.88$	$46.51\pm4.57$	$77.56\pm 7.12$	$60.85\pm6.78$
hopper-medium-expert-v2	$51.89\pm2.11$	$53.35\pm5.63$	$105.47\pm5.03$	$90.37 \pm 31.29$	$109.07\pm 4.56$
walker2d-medium-v2	$68.12\pm12.08$	$75.33\pm5.2$	$72.73\pm4.66$	$80.16\pm 4.19$	$71.04 \pm5.64$
walker2d-medium-expert-v2	$91.36\pm23.13$	$109.07\pm0.32$	$109.17\pm0.71$	$110.03 \pm 0.72$	$99.81\pm17.73$

How to use this codebase for your research

This codebase can be used independently as a baseline for D4RL projects. It is also designed to be flexible, allowing users to develop new algorithms or adapt them for datasets other than D4RL.

For researchers interested in using this code for their projects, we provide a detailed explanation of the code's shared structure:

Data structure

Transition(NamedTuple):
    observations: jnp.ndarray
    actions: jnp.ndarray
    rewards: jnp.ndarray
    next_observations: jnp.ndarray
    dones: jnp.ndarray

def get_dataset(...) -> Transition:
    ...
    return dataset

The code includes a Transition class, defined as a NamedTuple, which contains fields for observations, actions, rewards, next observations, and done flags. The get_dataset function is expected to output data in the Transition format, making it adaptable to any dataset that conforms to this structure.

Trainer class

class Trainer(NamedTuple):
    actor: TrainState
    critic: TrainState
    # hyper parameter
    discount: float = 0.99
    ...
    def update_actor(agent, batch: Transition):
        ...
        return agent

    def update_critic(agent, batch: Transition):
        ...
        return agent

    @partial(jax.jit, static_argnames("n_jitted_updates")
    def update_n_times(agent, data, n_jitted_updates)
      for _ in range(n_updates):
        batch = data.sample()
        agent = update_actor(batch)
        agent = update_critic(batch)
      return agent

def create_trainer(...):
    # initialize models...
    return Trainer(
        acotor=actor,
        critic=critic,
    )

For all algorithms, we have Trainer class (e.g. TD3BCTrainer for TD3+BC) which encompasses all necessary components for the algorithm: models, hyperparameters, and update logics. The Trainer class is versatile and can be used outside of the provided files if the create_trainer function is properly implemented to meet the necessary specifications for the Trainer class. Note: So far, we have not followed the policy for CQL due to technical issues. This will be handled in the near future.

Cite JAX-CORL

@article{nishimori2024jaxcorl,
  title={JAX-CORL: Clean Sigle-file Implementations of Offline RL Algorithms in JAX},
  author={Soichiro Nishimori},
  year={2024},
  url={https://github.com/nissymori/JAX-CORL}
}

Credits

This project is inspired by CORL, clean single-file implementations of offline RL algorithm in pytorch.
I would like to thank @JohannesAck for his TD3-BC codebase and helpful advices.
The IQL implementation is based on implicit_q_learning.
AWAC implementation is based on jaxrl.
CQL implementation is based on JaxCQL.
DT implementation is based on min-decision-transformer.

Name		Name	Last commit message	Last commit date
Latest commit History 323 Commits
algos		algos
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JAX-CORL

Algorithms

Reports for D4RL mujoco

Normalized Score

How to use this codebase for your research

Data structure

Trainer class

See also

Cite JAX-CORL

Credits

About

Releases

Packages

Languages

License

nissymori/JAX-CORL

Folders and files

Latest commit

History

Repository files navigation

JAX-CORL

Algorithms

Reports for D4RL mujoco

Normalized Score

How to use this codebase for your research

Data structure

Trainer class

See also

Cite JAX-CORL

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages