.. _time-management: Time Management and Control Flow ================================== *EtaCtrl* acts as the central coordinator that drives the interaction between the agent, the environment, and any optional components such as external models. It controls when the environment performs a step, when the agent receives observations, and when actions are applied. Understanding this control flow is important when integrating custom environments or interpreting logged data. Control Loop ------------- The sequence diagram below shows the full data flow for a ``play()`` call (the same structure applies during ``learn()``). The participants are: - **EtaCtrl** — the outer controller that manages the episode loop. - **Agent** — the control algorithm that maps observations to actions. - **Env** — the environment under control. - **ExternalModel** — an optional model called from inside the environment step (e.g. an FMU simulator or Pyomo model). - **Scenario Manager** — provides pre-loaded time-series data to the environment. .. figure:: figures/eta_ctrl_data_flow.svg :alt: Sequence diagram of the EtaCtrl control loop Sequence diagram of the EtaCtrl control loop during ``play()``. The episode starts with ``reset()``. The environment loads scenario data at index ``t=0``, performs any internal initialisation, and returns the initial observations to *EtaCtrl*. No time has elapsed at this point, the reset corresponds to the state *before* the first sampling period begins. The loop then runs until ``n_steps`` reaches ``n_episode_steps``. In each iteration: 1. *EtaCtrl* calls ``agent.predict(observations)`` to obtain actions for the current timestep. 2. *EtaCtrl* calls ``env.step(actions)``. 3. Inside the step, the environment first loads scenario data at index ``t = n_steps``, then calls specific environment code in ``BaseEnv._step`` 4. The environment increments ``n_steps``, loads scenario data at the next timestep and returns the new observations and the reward. Therefore, the observations returned after the *k*-th step carry scenario data at index ``t = k + 1``, and actions computed from those observations also target timestep ``t = k + 1``. The last step sets ``n_steps = n_episode_steps``, so the last scenario data index used is ``t = n_episode_steps``. Scenario Data and Time Indexing --------------------------------- The following diagram illustrates the time indexing concretely using a tank heating example. The temperature trace shows how the state evolves; the indexed markers on the top and bottom show at which time index observations are reported and actions take effect. .. plot:: pyplots/kea_data_flow.py The key points to take away from the diagram: - **Reset does not advance time.** The reset returns the observation at ``t=0`` without consuming a sampling period. - **Each step advances time by exactly one sampling period.** The step from ``t=k`` to ``t=k+1`` is where physical time passes. - **The last observation index is** ``t = n_episode_steps``. Scenario data must therefore cover at least ``n_episode_steps + 1`` data points (indices 0 through ``n_episode_steps``). Timing Assumptions ------------------- The *EtaCtrl* framework inherits the synchronous, single-agent execution model of *gymnasium* and *stable_baselines3*: time advances in fixed steps and the next step cannot begin before the current one returns. This implies the following assumption: .. note:: The computation time of the agent must be significantly shorter than the sampling time of the environment. If this is not the case the framework's notion of discrete, synchronous timesteps breaks down and results will no longer correspond to real-time behaviour. In practice this is satisfied whenever reinforcement learning inference or rule-based control runs in software (microseconds to milliseconds) and the physical process is sampled at intervals of seconds or more. Actuation Delay ^^^^^^^^^^^^^^^^ If the application requires modelling a delay between when an action is decided and when it takes effect — either to represent a physical actuator latency or a deliberate hold-over logic — this can be implemented as an environment wrapper without modifying the core environment: .. code-block:: python import gymnasium as gym import numpy as np class ActuationDelayWrapper(gym.Wrapper): """Delays actions by *delay* timesteps.""" def __init__(self, env: gym.Env, delay: int = 1) -> None: super().__init__(env) self._delay = delay self._action_buffer: list = [] def reset(self, **kwargs): self._action_buffer.clear() return self.env.reset(**kwargs) def step(self, action): self._action_buffer.append(action) if len(self._action_buffer) <= self._delay: delayed_action = np.zeros_like(action) else: delayed_action = self._action_buffer[-1 - self._delay] return self.env.step(delayed_action) This approach requires that the agent is still fast enough relative to the sampling time — the wrapper only shifts *which* action is applied, it does not compensate for slow agents.