BlendRL Appendix
BlendRL Appendix
705
A A PPENDIX
706
707
A.1 D EITAILED BACKGROUND OF R EINFORCEMENT L EARNING
708
709 We provide more detailed background for reinforcement learning. Policy-based methods directly optimize
710 πθ using the noisy return signal, leading to potentially unstable learning. Value-based methods learn to
711 approximate the value functions V̂ϕ or Q̂ϕ , and implicitly encode the policy, e.g. by selecting the actions
712 with the highest Q-value with a high probability (Mnih et al., 2015). To reduce the variance of the estimated
713 Q-value function, one can learn the advantage function Âϕ (st , at ) = Q̂ϕ (st , at ) − V̂ϕ (st ). An estimate of
714 Pk−1
the advantage function can be computed as Âϕ (st , at ) = i=0 γ i rt+i + γ k V̂ϕ (st+k ) − V̂ϕ (st ) (Mnih et al.,
715 2016). The Advantage Actor-critic (A2C) methods both encode the policy πθ (i.e. actor) and the advantage
716
function Âϕ (i.e. critic), and use the critic to provide feedback to the actor, as in (Konda & Tsitsiklis,
717
1999). To push πθ to take actions that lead to higher returns, gradient ascent can be applied to LP G (θ) =
718
Ê[log πθ (a | s)Âϕ ]. Proximal Policy Optimization (PPO) algorithms ensure minor policy updates that avoid
719
catastrophic drops (Schulman et al., 2017), and can be applied to actor-critic methods. To do so, the main
720
objective constraints the policy ratio r(θ) = ππθ θ (a|s)
(a|s) , following L
PR
(θ) = Ê[min(r(θ)Âϕ , clip(r(θ), 1 −
721 old
722 ϵ, 1 + ϵ)Âϕ )], where clip constrains the input within [1 − ϵ, 1 + ϵ]. PPO actor-critic algorithm’s global
723 objective is L(θ, ϕ) = Ê[LP R (θ) − c1 LV F (ϕ)], with LV F (ϕ) = (V̂ϕ (st ) − V (st ))2 being the value function
724 loss. An entropy term can also be added to this objective to encourage exploration.
725
726 A.2 D ETAILS OF D IFFERENTIABLE F ORWARD R EASONING
727
728 We provide the details of differentiable forward reasoning.
729
730 Definition A.1 A Forward Reasoning Graph is a bipartite directed graph (VG , V∧ , EG→∧ , E∧→G ), where
731 VG is a set of nodes representing ground atoms (atom nodes), V∧ is set of nodes representing conjunctions
732
(conjunction nodes), EG→∧ is set of edges from atom to conjunction nodes and E∧→G is a set of edges from
conjunction to atom nodes.
733
734 BlendRL performs forward-chaining reasoning by passing messages on the reasoning graph. Essentially,
735 forward reasoning consists of two steps: (1) computing conjunctions of body atoms for each rule and (2)
736 computing disjunctions for head atoms deduced by different rules. These two steps can be efficiently com-
737 puted on bi-directional message-passing on the forward reasoning graph. We now describe each step in
738 detail.
739
(Direction →) From Atom to Conjunction. First, messages are passed to the conjunction nodes from atom
740 nodes. For conjunction node vi ∈ V∧ , the node features are updated:
741
742 (t+1)
_ (t) ^ (t)
vi = vi , vj , (4)
743 j∈N (i)
744 V W
745 where is a soft implementation of conjunction, and is a soft implementation of disjunction. Intuitively,
746
probabilistic truth values for bodies of all ground rules are computed softly by Eq. 4.
747 (Direction ←) From Conjunction to Atom. Following the first message passing, the atom nodes are then
748 updated using the messages from conjunction nodes. For atom node vi ∈ VG , the node features are updated:
749
(t+1)
_ (t) _ (t)
750 vi = vi , wji · vj , (5)
j∈N (i)
751
16
Under review as a conference paper at ICLR 2025
752 up(img)
753
754
left(img) ∧
755 is_empty(img)
∧
756 above(diver,agent) ∧
757 left_of(diver,agent)
758
759 Figure 8: Forward reasoning graph for rules in Listing 1. A reasoning graph consists of atom nodes and
760 conjunction nodes, and is obtained by grounding rules i.e. , removing variables by, e.g. , X ← obj1, Y ←
761
obj2. By performing bi-directional message passing on the reasoning graph using soft-logic operations,
BlendRL computes logical consequences in a differentiable manner. Only relevant nodes are shown (Best
762
viewed in color).
763
764
765 where wji is a weight of edge ej→i . We assume that each rule Ck ∈ C has its weight θk , and wji = θk if
766 edge ej→i on the reasoning graph is produced by rule Ck . Intuitively, in Eq. 5, new atoms are deduced by
767
gathering values from different ground rules and from the previous step.
768 We used product for conjunction, and log-sum-exp function for disjunction:
769 X
770 softor γ (x1 , . . . , xn ) = γ log exp(xi /γ), (6)
771 1≤i≤n
772 where γ > 0 is a smooth parameter. Eq. 6 approximates the maximum value given input x1 , . . . , xn .
773
Prediction. The probabilistic logical entailment is computed by the bi-directional message-passing. Let
774 (0)
775 xatoms ∈ [0, 1]|G| be input node features, which map a fact to a scalar value, RG be the reasoning graph, w
776
be the rule weights, B be background knowledge, and T ∈ N be the infer step. For fact Gi ∈ G, BlendRL
computes the probability as:
777
778 (0) (T )
p(Gi | xatoms , RG, w, B, T ) = xatoms [i], (7)
779
(T )
780 where xatoms ∈ [0, 1]|G| is the node features of atom nodes after T -steps of the bi-directional message-
781 passing.
782
783 A.3 B ODY PREDICATES AND THEIR VALUATIONS
784
785 We here provide examples of valuation functions for evaluating state predicates (e.g. closeby, left of,
786
etc.) generated by LLMs (in Python).
787 # A subset of valuation functions for Kangaroo and DonkeyKong (generated by LLMs)
788 def left_of(player: th.Tensor, obj: th.Tensor) -> th.Tensor:
player_x = player[..., 1]
789 obj_x = obj[..., 1]
790 obj_prob = obj[:, 0] # objectness
return sigmoid(alpha < obj_x - player_x) * obj_prob * same_level_ladder(player, obj)
791
792 def _close_by(player: th.Tensor, obj: th.Tensor) -> th.Tensor:
793 player_x = player[:, 1]
player_y = player[:, 2]
794 obj_x = obj[:, 1]
795 obj_y = obj[:, 2]
obj_prob = obj[:, 0] # objectness
796 x_dist = (player_x - obj_x).pow(2)
797 y_dist = (player_y - obj_y).pow(2)
dist = (x_dist + y_dist).sqrt()
798
17
Under review as a conference paper at ICLR 2025
799
return sigmoid(dist) * obj_prob
800
801 def on_ladder(player: th.Tensor, obj: th.Tensor) -> th.Tensor:
player_x = player[..., 1]
802 obj_x = obj[..., 1]
803 return sigmoid(abs(player_x - obj_x) < gamma)
...
804
805 # A subset of valuation functions for Seaquest (generated by LLMs)
806 def full_divers(objs: th.Tensor) -> th.Tensor:
divers_vs = objs[:, -6:]
807 num_collected_divers = th.sum(divers_vs[:,:,0], dim=1)
808 diff = 6 - num_collected_divers
return sigmoid(1 / diff)
809
810 def not_full_divers(objs: th.Tensor) -> th.Tensor:
return 1 - full_divers(objs)
811
812 def above(player: th.Tensor, obj: th.Tensor) -> th.Tensor:
813 player_y = player[..., 2]
obj_y = obj[..., 2]
814 obj_prob = obj[:, 0]
815 return sigmoid( (player_y - obj_y) / gamma) * obj_prob
...
816
817
818
819
A.4 B LEND RL RULES
820
821 We here provide the blending and action rules obtained by BlendRL on Seaquest and DonkeyKong.
822
% Blending rulset (Seaquest)
823
0.98 neural_agent(X):-close_by_enemy(P,E).
824 0.74 neural_agent(X):-close_by_missile(P,M).
825 0.02 logic_agent(X):-visible_diver(D).
0.02 logic_agent(X):-oxygen_low(B).
826
1.0 logic_agent(X):-full_divers(X).
827 % Policy rulset (Seaquest)
828 0.21 up_air(X):-oxygen_low(B).
0.22 up_rescue(X):-full_divers(X).
829
0.21 left_to_diver(X):-right_of_diver(P,D),visible_diver(D).
830 0.24 right_to_diver(X):-left_of_diver(P,D),visible_diver(D).
831 0.23 up_to_diver(X):-deeper_than_diver(P,D),visible_diver(D).
0.22 down_to_diver(X):-higher_than_diver(P,D),visible_diver(D).
832
833 % Blending ruleset (DonkeyKong)
834 0.92 neural_agent(X):-close_by_barrel(P,B).
0.28 logic_agent(X):-nothing_around(X).
835
% Policy ruleset (DonkeyKong)
836 0.88 up_ladder(X):-on_ladder(P,L),same_floor(P,L).
837 0.47 right_ladder(X):-left_of(P,L),same_floor(P,L).
0.18 left_ladder(X):-right_of(P,L),same_floor(P,L).
838
839
840 A.5 E XPERIMENTAL DETAILS
841
842 We here provide more details about our implementations. We also release code together with the paper,
843
available at https://anonymous.4open.science/r/anon-blendrl-BA06
844 Hardwares. All experiments were performed on one NVIDIA A100-SXM4-40GB GPU with Xeon(R):8174
845 [email protected] and 100 GB of RAM.
18
Under review as a conference paper at ICLR 2025
846
Value function update. We update the value function and the policy as follows: In the following, we
847
consider a (potentially pretrained) actor-critic neural agent, with vϕ its differentiable state-value function
848 parameterized by ϕ (critic). Given a set of action rules C, let π(C,W) be a differentiable logic policy. BlendRL
849 learns the weights of the action rules in the following steps. For each non-terminal state st of each episode,
850 we store the actions sampled from the policy (at ∼ π(C,W) (st )) and the next states st+1 . We update the
851 value function and the policy as follows:
852
δ = r + γvϕ (st+1 ) − vϕ (st ) (8)
853
854 ϕ = ϕ + δ∇ϕ vϕ (st ) (9)
855 W = W + δ∇W ln π(C,W) (st ). (10)
856
The logic policy π(C,W) thus learns to maximize the expected return.
857
858
A.6 T RAINING DETAILS
859
860 We hereby provide further details about the training. Details regarding environments will be provided in the
861 next section A.7. We used the Adam optimizer (Kingma & Ba, 2015) for all baselines.
862
863
BlendRL. We adopted an implementation of the PPO algorithm from the CleanRL project (Huang et al.,
2022). Hyperparameters are shown in Table 2. The object-centric critic is described in Table 3. We provide
864
a pseudocode of BlendRL policy reasoning in Algorithm 1.
865
866 Parameter Value Explanation
867 blend ent coef 0.01 Entropy coefficient for blending regularization
868 (Eq. 3)
869 blender learning rate 0.00025 Learning rate for blending module
870
clip coef 0.1 Coefficient for clipping gradients
ent coef 0.01 Entropy coefficient for policy optimization
871 γ 0.99 Discount factor for future rewards
872 learning rate 0.00025 Learning rate for neural modules
873 logic learning rate 0.00025 Learning rate for logic modules
874 max grad norm 0.5 Maximum norm for gradient clipping
875
num envs 512 Number of parallel environments
num steps 128 Number of steps per policy rollout
876 total timesteps 20000000 Total number of training timesteps
877
878 Table 2: Hyperparameters for BlendRL training.
879
880
Layer Configuration
881
Fully Connected Layer 1 Linear(Nin , 120)
882 Activation 1 ReLU()
883 Fully Connected Layer 2 Linear(120, 60)
884 Activation 2 ReLU()
885 Fully Connected Layer 3 Linear(60, Nout )
886
Table 3: Object-centric critic networks for BlendRL.
887
888
NUDGE. We used a public code2 to perform experiments with the CleanRL training script. All hyperpa-
889
rameters are shared with the BlendRL agents, as described in Table 2. We used the same ruleset as BlendRL
890 agents for NUDGE agents. The critic network on object-centric states is described in Table 3.
891
2
892 https://github.com/k4ntz/NUDGE
19
Under review as a conference paper at ICLR 2025
893
Algorithm 1 BlendRL Policy Reasoning
894
895 Input: πθneural , πϕlogic , VµCN N , VωOC , blending function B, state (x, z)
896 1: β = B(x, z) # Compute the blending weight β
897 2: action ∼ β · πθneural (x) + (1 − β) · πϕlogic (z) # Action is sampled from the mixed policy
898 3: value = β · VµCN N (x) + (1 − β) · VωOC (z) # Compute the state value using β
899
4: return action, value
900
901
902
Neural PPO. We used an implementation of the neural ppo algorithm 3 from the CleanRL project. The
agent consists of an actor network and a critic network, sharing their weights except for the last layer. The
903
base network shared by the actor and the critic is shown in Table. 4. A linear layer with non-shared weights
904
follows for each actor and critic after on top of the base network.
905
906 Layer Configuration
907 Convolutional Layer 1 Conv2d(4, 32, 8, stride=4)
908 Activation 1 ReLU()
909 Convolutional Layer 2 Conv2d(32, 64, 4, stride=2)
Activation 2 ReLU()
910
Convolutional Layer 3 Conv2d(64, 64, 3, stride=1)
911 Activation 3 ReLU()
912 Flatten Layer Flatten()
913 Fully Connected Layer Linear(64 * 7 * 7, 512)
914 Activation 4 ReLU()
915
Table 4: Layer configuration of the neural ppo agent.
916
917
918 A.7 E NVIRONMENT D ETAILS
919
920 We hereby provide details of the environments we used in our experiments. We used HackAtari4 , a frame-
921 work that offers modifications of Atari environments to simply them or change them for the robustness test.
922
In our experiments, the modifications we used are shown in Table 5.
923
Environment Option Explanation
924 Kangaroo disable falling coconut Disable the falling coconut
925 change level0 The first stage is repeated
926 random position Randomize the starting position
927 Seaquest No option
928 DonkeyKong change level0 The first stage is repeated
random position Randomize the starting position
929
930 Table 5: Options and explanations for different environments
931
932
933 A.8 A BLATION STUDY: NEURAL VS . LOGIC BLENDING MODULE
934
935
We here provide an ablation study on the logic blending module, reruning BlendRL agents using a neural one
on Kangaroo and Seaquest. As shown in Table 6, the agents that encompass logic-based blending modules
936
outperform the neural ones.
937
938 3
https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo_atari.py
4
939 https://github.com/k4ntz/HackAtari
20
Under review as a conference paper at ICLR 2025
940
Figure 9 presents the entropies of the blending weights. An entropy value of 1.0 signifies that neural and
941
logic policies are equally prioritized (with each receiving a weight of 0.5), while an entropy value of 0.0
942 indicates that only one policy is active, with the other being completely inactive. In both environments,
943 the logic blending module consistently produced higher entropy values for the blending weights, indicating
944 effective utilization of both neural and symbolic policies without overfitting to either one.
945
946 Entropies over Blending Weights
947 0.5 logic bl.
948 neural bl.
0.4
Entropy
949
Episodic Return Kangaroo Seaquest
Neural Blending 91.6 ± 43.6 17.8 ± 0.55 0.3
950
Logic Blending 186±12 39.9±7.12
951 0.2
952 Table 6: Comparison: Neural v.s. Logic for policy 0.1
953 blending. Average returns over 10 different random kangaroo seaquest
954 seeds are shown of trained agents. The logic blending Environment
955 module outperforms neural one consistently.
956 Figure 9: Logic blending can keep both
957 policies effective. Entropies over blending
958 weights are shown.
959
960 A.9 P ROGRESSIVE ENVIRONMENT ILLUSTRATION
961
962 As explained by Delfosse et al. (2024c), Seaquest is a progressive environment in which the agent first
963 needs to master easy tasks, before being provided with more complex ones in newly unlocked parts of the
964 environment, reflected by the number of enemy to be shot here.
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979 Figure 10: Seaquest is a progressive environment. The more an agent collects points/reward (depicted at
980 the top), the more enemies are spawning
981
982
983
984
985
986
21