A Reinforcement Learning Framework For Container Selection and Ship Load Sequencing in Ports

A Reinforcement Learning Framework for Container Selection
and Ship Load Sequencing in Ports

Extended Abstract
Richa Verma, Sarmimala Saikia, Harshad Ashwin Srinivasan
Khadilkar, Puneet Agarwal, Gautam Shroff Birla Institute of Technology and Science
TCS Research, Delhi 201309, India Goa 403726, India
[email protected] [email protected]
ABSTRACT Mapping task
Tiers
ys
Ba
We describe a reinforcement learning (RL) framework for selecting
Rows
and sequencing containers to load onto ships in ports. The goal is
to minimize an approximation of the number of crane movements Z
Y
require to load a given ship, known as the shuffle count. It can be Yard
viewed as a version of the assignment problem in which the se- X
Ship
quence of assignment is of importance and the task rewards are
order dependent. The proposed methodology is developed specif-
ically to be usable on ship and yard layouts of arbitrary scale, by
dividing the full problem into fixed future horizon segments and Figure 1: Graphical illustration of the problem. Each slot is
through a redefinition of the action space into a binary choice frame- associated with a bay, row, and tier.
work. Using data from real-world yard and ship layouts, we show
that our approach solves the single crane version of the loading size of the state-action space independent of the problem instance
problem for entire ships with better objective values than those for scalability, (3) reward shaping for fast convergence, and (4)
computed using standard metaheuristics. demonstration of the generalisation and transfer learning ability of
the trained policy network on real-world problem instances.
KEYWORDS
Reinforcement learning; Single agent planning & scheduling 2 PROBLEM DESCRIPTION
ACM Reference Format: An illustration of the problem is shown in Figure 1. The goal is to
Richa Verma, Sarmimala Saikia, Harshad Khadilkar, Puneet Agarwal, Gau- minimize the total number of rearrangements (shuffles) required
tam Shroff and Ashwin Srinivasan. 2019. A Reinforcement Learning Frame- for loading the whole ship. While the problem structure and linear
work for Container Selection and Ship Load Sequencing in Ports. In Proc. of objective function are similar to the classical assignment problem
the 18th International Conference on Autonomous Agents and Multiagent Sys- [2], the number of agents (containers) and tasks (slots) is not equal,
tems (AAMAS 2019), Montreal, Canada, May 13–17, 2019, IFAAMAS, 3 pages.
and the cost is dependent on the loading order. We describe a simple
version of the loading problem where there is a single loading crane
1 INTRODUCTION and a fixed set of required container characteristics for each slot on
Busy ports can handle more than 30 million containers per year, the ship. Let ψ be the set of all slots on a ship, with |ψ | = N . The
or over 80 thousand containers per day [4]. Container loading slots have to be filled in a predefined order {ψ 1 , . . . ,ψ N } by exactly
and unloading operations in the storage yard are among the most one container per slot. The requirements for each slot are known,
complex in the industry [1], and reducing the time it takes to load a and we assume that there are Mψ ≤ N unique combinations (which
ship can have a significant effect on the yard efficiency. Containers we call mask IDs) among the slots in ψ . The set of all containers
are stored in the form of vertical pillars in the yard, with each pillar in the yard is denoted by κ 0 with |κ 0 | = K ≥ N . Based on physical
possibly a dozen containers high. Whenever a container in a lower characteristics and the cargo carried, each container C j ∈ κ 0 (j ∈
position in a stack needs to be accessed, the containers above it [0, K]) is also mapped to a mask ID in {1, . . . , Mψ , . . . , Mκ }.
have to be moved away first, increasing non-productive moves We define the operator µ, which returns the mask ID of a slot or
referred to as shuffles. The algorithm described in this paper aims to a container as a scalar integer value. A container C j is eligible for
minimise the total number of shuffles required to load an outbound a slot ψi if and only if 1 ≤ µ(C j ) = µ(ψi ) ≤ Mψ . The position of a
ship, given a known layout of containers in the yard. This is a container C j in the yard is given by P(C j ) = {x j , y j , z j }, with the
combinatorial optimisation problem, which is difficult to solve using axes marked in Figure 1. Instead of physical dimensions, the values
traditional approaches. The key contributions are: (1) formulation of x, y, and z are the indices of containers along the relevant axes.
of the container loading task as an RL problem, (2) keeping the A set of containers with the same x and y coordinates but different
z coordinates are known as a pillar. The load planning problem
Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems starts with the initial yard state κ 0 , where the position P(C j ) of each
(AAMAS 2019), N. Agmon, M. E. Taylor, E. Elkind, M. Veloso (eds.), May 13–17, 2019,
Montreal, Canada. © 2019 International Foundation for Autonomous Agents and
container is given. The initial state of the ship is empty, and the
Multiagent Systems (www.ifaamas.org). All rights reserved. first slot to be filled is ψ 1 . We denote the combined ship and yard
initial state by loading step s 0 . One or more containers may need imminent loading (µ(ψi ) → 0). The mask ID of the next slot gets
to be moved away by the loading equipment in order to access the a value of 1 (µ(ψi+1 ) → 1) and so on, up to h upcoming slots. The
chosen container for slot ψ ℓ , depending on its position in the pillar. pillars chosen by the environment are also encoded using relative
These unproductive operations are known as shuffles. The number mask IDs and presented to the RL agent as part of the state.
of shuffles on choosing C j at loading step s ℓ is denoted by Q(C j , ℓ). We use a variable threshold τ as the target shuffle count to be
When loading step s ℓ is completed, we assume that all containers achieved by the RL agent at the end of an episode. Prior to training,
C j∗ that were moved away are returned to their original pillar, with τ is set to an arbitrarily large value. A terminal (final) reward +R fin
updated z coordinates z ∗j ← z ∗j − 1. This reflects the departure of is provided if the total shuffle count J ≤ τ . If the latest J is better
the loaded container from the yard. The new yard layout with one than the previous best observed value, a further reward of +R fin is
fewer container is designated κ ℓ+1 and the algorithm proceeds to given. If J > τ , a reward of −R fin is given. The value of τ is updated
loading step s ℓ+1 , terminating at step s N . The objective of load to be equal the average shuffle count after every E episodes during
ÍN −1 training, if this is less than its previous value. Thus the reward
planning is to minimise the total shuffle count, J = ℓ=0 Q(C j , ℓ).
adapts to the desired value for a specific problem. We train the
policy network in an episodic setting. The RL agent is trained on
3 METHODOLOGY multiple instances, allowing it to generalize its knowledge. In each
We model the problem as a Markov Decision Process (S,A,R,P,γ ) training iteration, we play E episodes for each ship to explore the
[3] where S and A are the sets of states and actions, respectively, P probabilistic space of possible actions using the current policy.
represents the transition function, R denotes the reward function
and γ is the discount factor. A standard reinforcement learning 4 EXPERIMENTS AND RESULTS
setup consists of an agent interacting with its environment through
We compare the performance of the proposed RL methodology
a series of actions. At every time-step t, the agent receives the
with two metaheuristic approaches, based on simulated annealing
current state st of a fully-observed environment, takes an action
and genetic algorithms. Three independent datasets are used for
at based on a policy π : s → a, and receives a scalar reward
training, testing, and comparison. These are obtained from real-
r t = R(st , at ). The goal of learningÍis∞to maximize the expected
world operational data for three ships and their respective yard
γ t r t . We represent the

cumulative discounted reward, E i=t
layouts. The number of slots vary from 130 to 1391, the mask IDs
policy π by a multilayer perceptron (MLP) network with parameters
from 25 to 97, and the number of containers in the yard from 25,000
θ P . We use policy gradients to train the RL agent.
to 7,00,000 across the three instances. The results are compiled in
We first divide the problem of computing the entire load plan into
Table 1. Apart from simulated annealing (SA) and genetic algo-
single loading steps s ℓ . Each loading step is composed of multiple
rithms (GA), we also use a greedy heuristic (always pick topmost
time steps st in which suggested containers are rejected, until a
matching container) as a baseline. Among RL approaches, RL-S1 is
container is selected for loading into slot ψ ℓ . The state space is
trained only on Ship 1 data, while RL-S2 is trained only on Ship 2
restricted to look ahead to a fixed number of slots on the ship, and to
data. RL-MUL is trained on Ship 1 followed by Ship 2. None of the
a fixed number of containers in the yard. This standardises the size
algorithms are trained on Ship 3. In the test data, we note that the
of the state space for all ship and yard layouts, facilitating scalability.
RL approaches outperform the baselines on all three sets. RL-MUL
The actual mask IDs of slots and containers are represented in the
performs best in 2 of the 3 cases, demonstrating generalisation.
state as relative numbers, based on how imminently a given mask ID
Finally, all three RL approaches perform well on the unseen Ship 3
needs to be loaded. This makes the RL formulation agnostic towards
data, which demonstrates transfer learning.
actual mask IDs, facilitating generalisability. Finally, the reward
does not depend on absolute shuffle counts, but on improvement REFERENCES
over previous episodes. This property allows the RL to work in [1] Héctor J Carlo, Iris FA Vis, and Kees Jan Roodbergen. 2014. Storage yard opera-
instances with different ‘optimal’ shuffle counts. tions in container terminals: Literature overview, trends, and research directions.
The environment uses a scouting procedure to identify pillars in European Journal of Operational Research 235, 2 (2014), 412–430.
[2] J Munkres. 1957. Algorithms for the Assignment and Transportation Problems.
the yard that hold at least one eligible container for the current slot, J. Soc. Indust. Appl. Math. 5, 1 (March 1957), 32–38.
and present them to the RL agent one at a time. The environment
chooses one of the eligible containers in this set as the current Algo. Ship 1 Ship 2 Ship 3
suggestion to the RL agent. The agent computes a binary decision Test Run Test Run Test Run
based on the current suggestion and the remaining pillars: whether Greedy 273 (0.17) 282 (0.25) 43 (0.02)
to pick the presented container for loading in the current slot, or SA 243, 261.0 (177) 269, 287.7 (296) 31, 39.4 (2.1)
to move to the next option (A = {pick, move}). The decision is GA 283 (398) 278 (557) 30 (6.2)
based on the spatio-temporal context provided to the container, RL-S1 235, 252.5 (515) 269, 274.4 (377) 30, 35.1 (6.7)
as encoded in the state. For each container loading decision (slot RL-S2 238, 252.0 (515) 266, 273.6 (377) 29, 35.3 (6.7)
ψi , yard state κi−1 , loading step si−1 ), we only include the current RL-MUL 236, 251.5 (515) 264, 271.3 (377) 28, 35.0 (6.7)
slot and a fixed horizon length h of upcoming slots in a look-up
table which is used to encode a portion of the state. The mask IDs Table 1: Shuffle counts of algorithms. Numbers in parenthe-
associated with this slot sequence are mapped to relative mask IDs ses are computation times in seconds. The first value is the
for each container selection decision. Through the look-up table, best value, while the second value (if any) is the average.
the current slot’s mask ID is mapped to a value of 0, indicating
[3] Richard S. Sutton and Andrew G. Barto. 1998. Introduction to Reinforcement
Learning (1st ed.). MIT Press, Cambridge, MA, USA.
[4] World Shipping Council. 2018. Ports: About the Industry. (Accessed: March
2018). http://www.worldshipping.org/about-the-industry/global-trade/ports.

A Reinforcement Learning Framework For Container Selection and Ship Load Sequencing in Ports

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

A Reinforcement Learning Framework For Container Selection and Ship Load Sequencing in Ports

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Reinforcement Learning Framework For Container Selection and Ship Load Sequencing in Ports

Uploaded by

Copyright:

Available Formats

A Reinforcement Learning Framework for Container Selection

and Ship Load Sequencing in Ports

ABSTRACT Mapping task

You might also like