The problem

Consider operators doing their tasks on a production line, some tasks are more complex than others and can lead to line stoppage if too many complex tasks are executed by the same worker consecutively.

We can represent this as a matrix where:

The workers cannot change places, but we can still reorder the products. Thanks to some business rules and variables we can easily simulate the stoppage duration generated by a certain order of sequence, all of this simulation is part of the environment.

W0 W1 W2 Wn
Prod 1 34 49 55 9
Prod 2 59 57 11 18
Prod 3 30 22 5 32
Prod m 5 14 12 11

<aside> 🎯 The goal ****of the agent will be to order the rows of the sequence in a way that minimizes the duration of the stoppage by applying actions on the environment.

</aside>

The environment (The digital twin of the problem)

The environment is represented by:

The state

The state for this task is composed of:

The two matrices are stacked on the channel dimension to create the state.

The reward