Implementation

The EnGINE framework source code can be found here GitHub - Implementation.

In the EnGINE publication EnGINE - JNSM 2022, we describe the requirements on the framework, the design decisions, the optimization approaches, and evaluation of the functionality.

Brief Overview

The EnGINE framework is built using Ansible, an open-source orchestration software. Ansible is a descriptive language based on YAML and Jinja templates. There are three main components relevant for Ansible’s functionality:

Playbook: YAML files expressing configurations. Playbooks map a group of the host to a set of roles
Tasks: specific commands to be executed on the managed node
Roles: a way how to group content/tasks for reuse

Experiment management — Fig 1: Experiment and node management

The orchestration software is run on the management node which executes individual playbooks, connects over Secure Shell (SSH) to experiment nodes, and executes individual playbook tasks. A typical management flow is shown in Fig 1. The management node remotely executes commands on experiment node with the node then running the code itself and communicating with other nodes. The experiment nodes then transfer collected artifacts to the management host which finally processes the results.

Experiment execution

EnGINE executes experiments within campaigns. As indicated in Fig 2, each experiment campaign consists of four phases:

Install - allocate and boot necessary experimental nodes with a pre-defined OS image
Setup - Install additional prerequisites on experimental nodes if not already included with the OS image
Scenario - Conduct individual experiments from the experimental campaign one after another
Process - Process collected artefacts and prepare initial results and plots

The scenario phase is further divided into individual experiment runs, with each experiment being able to use its specific set of nodes, network configuration, TSN traffic policing, applications, etc. Each experiment consists of three major parts:

Setup - Network (interface and TSN configurations), Stacks (Applications, result collection), and Actions (events) are prepare in this order
Run - The actual experiment and measurements are performed
Post-processing - Artefacts are initially processed and collected to the management node. Then the experimental nodes configuration is flushed in preparation for future experiments.

Configuration

An experimental campaign is defined as a scenario that may contain one or more experiments. To define those experiments, including the network topology, executed services, and collected artifacts, we prepare a configuration format consisting of five .yaml configuration files:

00-nodes.yml – list the nodes that are going to be used for the experimental campaign.
01-network.yml – set the network configurations for the scenario.
02-stacks.yml – define the applications that are run on each node. Define start and stop time of these applications.
03-actions.yml – Currently unused with functionality partially implemented using 02-stacks.yml. List actions that will be executed at a specific time during the experiment (e.g., introduce a node failure after 30 seconds into the experiment).
04-experiments.yml – list the actual experiments that should be executed during the experimental campaign. An experiment refers to a defined configuration in the Network, Stacks, and Actions files. The experiment entries are combinations of only one instance from every topic. Can define one or more experiments per campaign.

In the following, we show examples for each of these configuration files. The resulting exemplary scenario and experimental campaing will consist of four nodes and have various traffic and qdisc configurations between those.

Exemplary Hardware Deployment Topology

Fig 3 shows an exemplary hardware deployment consisting of 13 nodes. The shown deployment is representative of that shown in the EnGINE - JNSM 2022 publication.

Expample Topology — Fig 3: Exemplary hardware deployment for EnGINE

`00-nodes.yml`

Define and map nodes that will be used within the scenario and experimental campaign. Node names have to correspond to the nodes in a given testbed. Prepares node definitions to be used with 01-network.yml and 02-stacks.yml in the format node-X.

---
# Set to "no" if you don't want to use the low-latency kernel
use_low_latency_kernel: "yes" 

# Set the number of isolated cores. Cores starting with core 0 will be isolated. Set to 0, no cores will be isolated! Max depends on your node configuration
num_isolated_cores: 3 

# Load the nodes for a pre-defined topology and its node mapping. Empty string to use own definition (see below).
topology: "" 

# Define your own required hosts and their mapping
nodes: 
  - zgw9
  - zgw8
  - zgw7
  - vcc1

# The mapping of the above nodes to node-X names
node_mapping:
  node-1: zgw9
  node-2: zgw8
  node-3: zgw7
  node-4: vcc1

`01-network.yml`

Set the network topology, the network paths that can be used (also referred to as flows), and define the TSN configuration on each interface. The “network” variable has different network instances (net-1, net-2, etc.) as subelements. Each network definition is split into two parts: flows, and tsn. A network topology is entirely defined through the flows. Only nodes and interfaces that occur in a flow are part of the topology that can be used for an experiment! It is recommended to add node-to-node flows to build the basic topology and on top of that add “long-distance” or more complex flows that are later used by services/applications. TSN part maps tsn cofigurations to nodes and interfaces, with the TSN qdisc configurations being specified under the “tsnconfigs” variable.

Each tsnconfig (e.g. tsn-1) contains a configuration specifying the TSN qdisc functionality associated with a hardware queue. Each queue entry creates a queue with the defined behavior. The prio parameter defines what priorities should be processed by this queue. There are several considerations for the configurations:

Number of queue items is fixed depending on the number of hardware queues (e.g., Intel i210 has 4 hardware queues)
Available priorities for the prio list go from 0 to 15
All priorities need to be assigned to the queues, use ‘*’ in one queue for remaining priorities
Always define at least one queue. Example of minimum queue configuration:
```
  {mode: be, prio: ['*']}
```

These are three main queue modes currently being supported, with additional ones potentially being introduced in the future:

Best-Effort - use the default FIFO (first-in-first-out) queuing
```
 { mode: be, prio: [0,1,'*'] }
```
Credit-Based-Shaping - use the TSN credit-based shaper to queue packets (IEEE 802.1Qav). The idle/send/high/low parameters must be set according to the qdisc specification and expected traffic patterns. Offload uses the Qav feature of supported network cards (e.g., this can only be used in queues 1 + 2 for the Intel i210).
```
 { mode: cbs, prio: [2,3], idle: '0', send: '0', high: '0', low: '0', (offload: true) }
```
Earliest Time First - queue the packets based on their launch timestamp. Offload uses the Launch-Time Feature of supported network cards (e.g., for Intel i210 this can only be used in queue 1 + 2). The default mode is strict, where packets are sent on their launch timestamp with “deadline: yes” this is changed to deadline_mode where packets are sent latest at their launch timestamp.
```
 { mode: etf, prio: [4], (delta: 150000), (offload: true), (deadline: yes) }
```

Additionaly, scheduling algorithms need to be used to divide the traffic across the queues. By default MQPRIO is used, mapping the traffic of specified priority to the corresponding queue. Alternatively, TAPRIO, which is the Time-aware priority scheduler introduced with IEEE802.1Qbv standard can be used. It is configured in addition to the queues under the taprio parameter, where the desired windows and their corresponding hardware queues can be specified.

There are several considerations for the configurations:

Queue 0 in a taprio sched entry can be used to introduce a guard band
ETF offload can only be set for queue 1 or 2, limitation of the intel i210 network card
ETF queues together with taprio must be used with txtime enabled or only special (timestamped) traffic works

The following code snipped introduces a configuration example where four different networks using two distinct tsn configurations are defined.

---
network:
  net-1:
    tsn:
      tsn-1: ["node-1:2"]
    flows:
      1: ':node-1:3,2:node-3:'
      2: ':node-1:2,2:node-2:1,4:node-3:'
    check: true
    num_net_cores: 2 # Specify the number of cores assigned to NIC IRQs. Cores beginning with core 0 will be allocated to the IRQs. E.g. if specified to 2 means that IRQs will be allocated to cores 0 and 1
    nic_irq_rt: true # Specify if the NIC IRQs will be set to real-time priority in the linux kernel
  
  net-2:
    tsn:
      tsn-1: ["node-1:2", "node-2:1"]
    flows:
      1: ':node-1:3,2:node-3:'
      2: ':node-1:2,2:node-2:1,4:node-3:'
      3: ':node-4:2,4:node-2:'
      4: ':node-4:1,1:node-3:'
    check: true
    num_net_cores: 2 
    nic_irq_rt: true 

  net-3:
    tsn:
      tsn-2: ["node-1"]
    flows:
      1: ':node-1:3,2:node-3:'
      2: ':node-1:2,2:node-2:1,4:node-3:'
    check: true
    num_net_cores: 2 
    nic_irq_rt: true 
  
  net-4:
    tsn:
      tsn-2: ["node-1", "node-2"]
    flows:
      1: ':node-1:3,2:node-3:'
      2: ':node-1:2,2:node-2:1,4:node-3:'
    check: true
    num_net_cores: 2 
    nic_irq_rt: true 


# Holds all TSN configurations
tsnconfigs:
  # TAPRIO TSN configuration with three child qdiscs - two with etf in offload (one also in deadline mode) and one best-effort queue
  # Taprio with 1ms schedule split across the queues with 300us, 300us, and 400us respectively
  tsn-1:
    name: 'Basic taprio - etf + etf deadline'   
    taprio:
      txtime: true
      delay: 400000
      sched:
        - { queue: [1], duration: 300 }
        - { queue: [2], duration: 300 }
        - { queue: [3], duration: 400 }
    queues:
      1: { mode: etf, prio: [3], delta: 300000, offload: yes }
      2: { mode: etf, prio: [2], delta: 300000, offload: yes, deadline: yes }
      3: { mode: be, prio: [1,'*'] }
  
  # MQPRIO TSN configuration with three child qdiscs - two cbs correctly configured for 100mbit/s each (see/scripts/configGenerators/prep_cbs_config.py) and one best effort
  tsn-2:
    name: 'Basic MQPRIO - two cbs + best-effort'
    taprio: {}
    queues:
      1: { mode: cbs, prio: [3], idle: '100000', send: '-900000', high: '155', low: '-1125' }
      2: { mode: cbs, prio: [2], idle: '100000', send: '-900000', high: '297', low: '-1125' }
      3: { mode: be, prio: [1,'*'] }

`02-stacks.yml`

Stacks define the services that are supposed to be run on select nodes during an experiment. Services are usually a synonym for applications. All available services and their parameters can be seen in the roles/services folder in the GitHub - Implementation repository.

Services are started and stopped during preparation and directly after the experiment run time. They send a signal to the experiment run script when stopped, indicating:

0 if the service finished successfully
1 when an error occurred.

If a failure signal (1) is received by the experiment runner, the experiment is aborted. When running the experiment in signal mode, the experiment finishes only when all registered services send a success signal to the runner. To register a service the parameter signal: yes must be defined in the service entry.

All services are started with the run_service_action wrapper script that takes care of the signaling and sending registered signals to the experiment runner.

All services have five parameters specifying their start/stop timing within the experiment:

level - At which application instantiation level the service should be started
wait - By how many second the service start should be delayed after either the point when service run control script was called or experiment start if sync_start set to yes.
run_time - For how many seconds the service should be run after start (counting from when application was actually started)
sync_start - Specify whether to synchronize the application start with experiment start
sync_stop - Specify whether to synchronize the application stop with experiment finish

NOTE: For iperf service, do not confuse time and run_time which are both valid parameters. time specifies the duration of iperf experiment using the capabilities of the application, while run_time would be used by the service run control script.

Using the use_core parameter, the services can further be pinned to CPU cores using CPU affinity.

The following example defines one stack with three Iperf3 clients on node-1, three corresponding Iperf3 servers on node-3 and respective tcpdump monitoring on both nodes.

---
stacks:  
  stack-1:
    name: 'Iperf measurements three streams'
    services:
      node-1:
        - { name: iperf, role: client, flow: 2, port: 1003, prio: 3, limit: 94400000, time: 4, size: 1180, level: 1, sync_start: yes, signal: yes, udp: yes, use_core: 2 }
        - { name: iperf, role: client, flow: 2, port: 1002, prio: 2, limit: 94400000, time: 4, size: 1180, level: 1, sync_start: yes, signal: yes, udp: yes, use_core: 2 }
        - { name: iperf, role: client, flow: 2, port: 1001, prio: 1, limit: 0, time: 6, size: 1180, level: 1, sync_start: yes, signal: yes, udp: yes, use_core: 2 }

        - { name: tcpdump, flow: [2], size: 64, filter: "udp dst port 1001", file: "p1001", level: 0, signal: no }
        - { name: tcpdump, flow: [2], size: 64, filter: "udp dst port 1002", file: "p1002", level: 0, signal: no }
        - { name: tcpdump, flow: [2], size: 64, filter: "udp dst port 1003", file: "p1003", level: 0, signal: no }
      node-3:
        - { name: iperf, role: server, flow: 2, port: 1001, level: 0, signal: yes }
        - { name: iperf, role: server, flow: 2, port: 1002, level: 0, signal: yes }
        - { name: iperf, role: server, flow: 2, port: 1003, level: 0, signal: yes }
        
        - { name: tcpdump, flow: [2], size: 64, filter: "udp dst port 1001", file: "p1001", level: 0, signal: no }
        - { name: tcpdump, flow: [2], size: 64, filter: "udp dst port 1002", file: "p1002", level: 0, signal: no }
        - { name: tcpdump, flow: [2], size: 64, filter: "udp dst port 1003", file: "p1003", level: 0, signal: no }
    protocols: {}

`03-actions.yml`

The playbook is a provision for future functionality where an event needs to be introduced during the runtime of an experiment. Currently, limited functionality for introduction of services/applications in this way is implemented in 02-stacks.yml. Future extension could inclide, e.g., fault injection, flow reconfiguration, qdisc reconfiguration, etc.

The following example shows how the action definition is currently implemented.

---
actions: {}

`04-experiments.yml`

The experiments definition playbook defines actual experiments that will be run within an experimental campaign. It brings together the configuration of network and stacks. After all experiments are executed the scenario is finished.

An experiment can be run in two different modes:

Time mode: A timer is started and the experiment runs for N seconds
Signal mode: Wait for services to successfully finish (e.g. send a fixed number of packets, send all data from a dataset, etc.), when they are then they will send a success signal. When all registered signals are received the experiment is stopped.

In both modes the experiment is immediately stoppped after receiving an error signal. You can comment out experiments that you don’t want to run.

---
experiments:
  ### Topology Triangle with TSN - two hops with TAPRIO/ETF configured only on the source
  - { network: 'net-1', stack: 'stack-1', action: 'action-1', time: 10, name: '1a_demo_run_etf_iperf_tsn-source' }
  
  ### Topology Triangle with TSN - two hops with TAPRIO/ETF configured the source and on the hop nodes
  - { network: 'net-2', stack: 'stack-1', action: 'action-1', time: 10, name: '1b_demo_run_etf_iperf_tsn-source+hop' }
  
  ### Topology Triangle with TSN - two hops with CBS configured only on the source
  - { network: 'net-3', stack: 'stack-1', action: 'action-1', time: 10, name: '2a_demo_run_cbs_iperf_tsn-source' }
  
  ### Topology Triangle with TSN - two hops with CBS configured the source and on the hop nodes
  - { network: 'net-4', stack: 'stack-1', action: 'action-1', time: 10, name: '2b_demo_run_cbs_iperf_tsn-source+hop' }

Measurements and Metrics

The measurements during experiments are performed by collecting two PCAP traces, one on the source and one on the sink of a network flow as introduced in 02-stacks.yml. To improve the accuracy, the captures are done using the hardware timestamping feature of the Network Interface Card (NIC) where possible. Since we know that the source’s and sink’s clocks are accurately synchronized using PTP, we can then correlate the two PCAPs together. The packets are correlated using the timestamp and sequence number contained within payload of Iperf3 and Send_UDP application generated packets.

We mostly consider two evaluation metrics for the experiments introduced above: end-to-end delay and jitter.

Delay calculation

The end-to-end delay $d_{e2e}$ is calculated as the difference between the time at which the packet was received at the sink $t_{R}$ and sent on the source $t_{S}$ as shown in the following equation:

$d_{e2e}(X) = t_{R}(X) - t_{S}(X) - t_{o}$

Due to the way how packet timestamps are recorded in Linux, in certain cases there might be an additional time offset $t_{o}$ that needs to be considered. Amongst others, this occurs when a system UTC and NIC hardware TAI timestamps are compared. The difference $t_{o}$ for those two-time formats at the time of writing amounts to 37s.

Jitter calculation

The calculation for end-to-end delay jitter, that is the difference in delay between two subsequent packets, is made based on $d_{e2e}$ introduced in the previous section. The following equation yields the jitter for packet X:

$j_{e2e}(X) = d_{e2e}(X) - d_{e2e}(X-1)$

The jitter is defined as a difference of packet $X$’s delay $d_{e2e}(X)$ and the delay of the previous packet $d_{e2e}(X-1)$. The jitter for the first evaluated packet $j_{e2e}(0)$ is ignored, as there were no packets before to consider.