Sequence optimization using Reinforcement learning with function approximation

The problem

Consider operators doing their tasks on a production line, some tasks are more complex than others and can lead to line stoppage if too many complex tasks are executed by the same worker consecutively.

We can represent this as a matrix where:

each row is a product to build
each column represents a worker
values of the matrix represent the complexity of the task

The workers cannot change places, but we can still reorder the products. Thanks to some business rules and variables we can easily simulate the stoppage duration generated by a certain order of sequence, all of this simulation is part of the environment.

	W0	W1	W2	Wn
Prod 1	34	49	55	9
Prod 2	59	57	11	18
Prod 3	30	22	5	32
Prod m	5	14	12	11

<aside> 🎯 The goal ****of the agent will be to order the rows of the sequence in a way that minimizes the duration of the stoppage by applying actions on the environment.

</aside>

The environment (The digital twin of the problem)

The environment is represented by:

The State
The reward
The actions and the dynamic/transition (the change of state caused by an action)

The state

The state for this task is composed of:

Matrix of tasks per worker
Matrix of the generated delay per worker (the current one, to keep the Markov assumption). The generated delay of the sequence is known thanks to business rules. And this delay, once above the maximum allowed delay, generate a stoppage.

The two matrices are stacked on the channel dimension to create the state.

The problem

The environment (The digital twin of the problem)

The state

The reward