Planning Problems
Planning Problems
102-109
10.1515/pomr-2017-0111
Yifan Shen1
Ning Zhao2*
Mengjue Xia1
Xueqiang Du2
1
Scientific Research Academy, Shanghai Maritime University, China
2
Logistics Engineering College, Shanghai Maritime University, China
* corresponding author
ABSTRACT
Ship stowage plan is the management connection of quae crane scheduling and yard crane scheduling. The quality
of ship stowage plan affects the productivity greatly. Previous studies mainly focuses on solving stowage planning
problem with online searching algorithm, efficiency of which is significantly affected by case size. In this study, a Deep
Q-Learning Network (DQN) is proposed to solve ship stowage planning problem. With DQN, massive calculation and
training is done in pre-training stage, while in application stage stowage plan can be made in seconds. To formulate
network input, decision factors are analyzed to compose feature vector of stowage plan. States subject to constraints,
available action and reward function of Q-value are designed. With these information and design, an 8-layer DQN
is formulated with an evaluation function of mean square error is composed to learn stowage planning. At the end
of this study, several production cases are solved with proposed DQN to validate the effectiveness and generalization
ability. Result shows a good availability of DQN to solve ship stowage planning problem.
Keywords: Deep Q-Leaning Network (DQN); Container terminal; Ship stowage plan; Markov decision process; Value function approximation;
Generalization
φi (4)= Pi − Si (11)
0, else
s ' is the next state, and a ' is the next action. The partial
in the wi direction is in (18).
(8) Represents the normalized weight of selected container.
(9) Represents the normalized tier number of selected slot. Ε s , a , r , s '[(r + γ max Q( s ', a '; wi )
∇ wi Li ( wi ) =
(10) Represents the normalized weight gap between selected a'
(18)
container and the container located right under it on the ship. − Q( s, a; wi ))∇ wi Q( s, a; wi )]
(11) Represents the potential of selected match of container
and slot (or action), which means number of remaining lighter Stochastic Gradient Descent is used to optimize the lost
container minus available ship slots above selected slot, or function, and the weight updates after every iteration, which
expression of influence of selected action to later stowage is quite similar to traditional Q-Learning algorithm.
planning. In order to approximate reward for new states that never
(12) Represents normalized reshuffles caused by this action. appeared before, a evaluation function approximation
(13) Represents normalized sequential gap between selected function is introduced to improve generalization ability.
container with containers located left of selected. Unlike supervised learning, reinforcement learning doesn’t
(14) Represents normalized sequential gap between selected have known tags for training, tags are obtained through
containers with containers located right under selected iterations. While a state and an action is updated, the change
container. of weight for this match can affect other matches, which
(15) Represents normalized sum of sequential gap between causes ineffectiveness of previous state and action matches,
selected container with containers located left of selected and and then it causes longer training time or even failure of
sequential gap between selected containers with containers training. Thus, an experience replay method is introduced
located right under selected container. to prevent ineffectiveness.
(16) Represents whether this container locates in the same Experience replay stores the experience of time t as
yard bay with the previous one. (φt , at , rt ,φt +1 ) in experience history queue D , and then D is
stochastically sampled as (φ j , a j , rj ,φ j +1 ) to do mini-batch
DEEP REINFOREMENT LEARNING ALORITHM FOR to update the weight. This ensures every history points are
STOWAGE PLANNING PROBLEM considered when updating a new data point. Experience replay
stores all previous states and action in a sequence to minimize
Figure 3 shows the framework of reinforcement learning or objective function when Q-Function updates.
Q-Learning for stowage plan. In the initial state of learning,
the intelligent agent is like a naïve planner, every action the Ε( s , a , r , s ')~U ( D ) [(r + γ max Q( s ', b; wi )
Li ( wi ) =
b
planner take will have a reward to update Q( s, a ) , and the (19)
agent will decide next action for next state depending on − Q( s, a; wi )) 2 ]
updated Q( s, a ), this is the iteration of reinforcement learning.
Actually, the agent learns from iterations of attempts and
4. Activation function
There are three widely used activation functions, TanH ,
Sigmoid and Relu ( Rectified Linear Units). Relu has better
training performance especially in attenuation of gradient
and network sparsely. Thus, Relu is used as the activation
function of this research.
In stowage planning DQN learning, robustness of This research was supported by the National Natural
algorithm refers to whether the training algorithm can get Science Foundation of China (No.61540045), the Science
good DQN with various stowage planning cases. and Technology Commission of Shanghai Municipality
In generalization analysis part, DQN trained with case. 1 (No.15YF1404900, No.14170501500), Ministry of Education
is used to plan case. 2 and case. 3. To verify the robustness of of the PR China (No.20133121110005), Shanghai Municipal
proposed algorithm, case. 2 and case. 3 are used as training Education Commission (No. 14ZZ140), Shanghai Maritime
set to get new DQNs. Planning results of different DQNs are University (No.2014ycx040).
shown below.
Tab. 4. Training parameters and time consumption REFERENCES
Training Set No. of Containers Iterations Training time
1. M. Omar, S. S. Supadi. 2012. Integrated models for shipping
Case. 1 19 150000 2 h 46 min a vendor’s final production batch to a single buyer under
Case. 2 19 150000 2 h 53 min linearly decreasing demand for consignment policy. Sains
Case. 3 28 150000 4 h 21 min Malaysiana 41.3: 367-370.
Training Time 0.069 0.073 0.131 0.071 0.081 0.142 0.068 0.073 0.137
4. Y. Shen, 2016. An Anti-Collision Method of Slip Barrel for
Automatic Ship Loading in Bulk Terminal. Polish Maritime
Research, 23(s1).
As in Table 5, different test cases shows good result with
different trained DQNs, and the efficiency of different 5. C. Mi, Y. Shen, W. J. Mi, Y. F. Huang, 2015. Ship Identification
DQNs are quite similar, which means influence of different Algorithm Based on 3D Point Cloud for Automated Ship
training cases and test cases are negligible. Thus, the proposed Loaders. Journal of Coastal Research, 2015(SI.73): 28-34.
algorithm has good stability and robustness. DOI: 10.2112/SI73-006.