Skip to main content

Configure TURN server autoscaling

Last updated on December 6, 2024

Overview

The peer-to-peer (P2P) offerings as part of AccelByte Gaming Services (AGS) Play includes the use of Traversal Using Relays around NAT (TURN) servers. TURN servers play an essential role in facilitating direct communication between players, especially when Network Address Translation (NAT) or firewall restrictions are factors. This article will provide an overview of TURN servers and their role within AGS Play, as well as steps to help you configure them.

Role of TURN servers

P2P communication allows players' devices to connect directly with each other, reducing the load on central servers and minimizing latency; however, direct connections are often obstructed by NAT and firewall configurations, which can prevent peers from establishing a direct link. TURN servers are used to circumvent this.

P2P Communication

TURN servers act as intermediaries that relay traffic between peers when direct connections fail. By doing so, they ensure that data can flow smoothly between players, regardless of their network environments. This relaying capability is critical in scenarios where other techniques, like Session Traversal Utilities for NAT (STUN), are insufficient to establish a direct connection.

Benefits of using TURN servers with AGS

  1. Enhanced Connectivity: TURN servers help ensure P2P players can consistently connect with each other regardless of network barriers.
  2. Low Latency: By facilitating direct P2P connections, TURN servers help minimize player latency.
  3. Scalability: TURN servers help distribute the network load, making the system more scalable and capable of handling a large number of simultaneous connections.
  4. Reliability: With TURN servers, games can maintain stable connections even in restrictive network environments.

Manage TURN servers in the AGS Admin Portal

Game admins and developers can manage TURN server configurations in the AGS Admin Portal. The TURN server management menu is available in the Publisher namespace under AGS Settings > TURN Server Configurations.

By default, there are no TURN servers deployed in the single-tenant environment of AGS Private Cloud. You can request a server deployment by clicking Submit a Ticket. You will be notified once a server has been deployed for your environment.

Once your server has been deployed, you will see it listed under Turn Server Regions on the Turn Server Configurations page. You can configure it by clicking the edit icon to its right and filling out the following fields on the pop-up that appears:

  • Min. TURN Servers: The minimum number of TURN servers that will be deployed to a specific region.
  • Max. TURN Servers: The maximum number of TURN servers that will be deployed to a specific region.
  • Threshold Autoscale: The autoscaling policy for the TURN server number, either Normal or Aggressive:
    • Normal Scaling
      • Scale-in CPU threshold: 20%
      • Scale-in time CPU threshold: 60 Seconds
      • Scale-out CPU threshold: 80%
      • Scale-out time CPU threshold: 60 Seconds
    • Aggressive Scaling
      • Scale-in CPU threshold: 40%
      • Scale-in time CPU threshold: 60 Seconds
      • Scale-out CPU threshold: 60%
      • Scale-out time CPU threshold: 30 Seconds
  • Enable Bandwidth Threshold: Enabling this will make the turn manager scale in/out the turn server based on the CPU utilization OR Bandwidth usage.

Scaling Parameter Details

  • Scale-out CPU threshold: The upper limit of CPU usage in percentage (%). For example, if it is configured to 80%, if the CPU average value of the TURN server on the region hits 85%, then the TURN Manager will need to deploy a new instance.
  • Scale-out time CPU threshold: The time limit for the scale-out process in seconds. For example, if it is configured to 60 seconds, if the CPU average value hits the upper limit of 60 seconds, then the TURN Manager will process the scale-out the TURN Server numbers.
  • Scale-in CPU threshold: The lower limit of CPU usage in percentage (%). For example, if it is configured to 20%, if the CPU average value of the TURN server on the region hits 15%, then the TURN Manager will mark the TURN server to be removed (scale-in).
  • Scale-in time CPU threshold: The time limit for the scale-in process in seconds. For example, if it is configured to 60 seconds, if the CPU average value hits the lower limit in 60 seconds, then the TURN Manager will process the scale-in the TURN server numbers.
  • Scale-in Bandwidth Limit (MB): The lower limit of the average network Tx = Transmit and Rx = Receive in megabytes (MB). For example, if it is configured to 1 MB, then when the average network Tx and Rx in 30 seconds (default value) is 1 MB, the TURN Manager will process to scale in the TURN server numbers.
  • Scale-out Bandwidth Limit (MB): The upper limit of the average network Tx = Transmit and Rx = Receive in Megabytes (MB). For example, if it is configured to 10 MB when the average network Tx and Rx in 30 seconds (default value) reaches 10 MB, then the TURN Manager will process to scale out the TURN server numbers.

When players experience an issue with a specific TURN server instance, the admin can deactivate the instance and the backend will spawn a new TURN server to replace it.

info

The default TURN server instance resources are using 512 MB of CPU and 1 GB of memory. We tested 1500 concurrent users (CCU) access to one TURN server with a default instance, and the result were:

  • The TURN server reached 60% CPU usage
  • Memory increased from 237 MB to 307 MB
  • Network usage averaged 2.76 MB/s tx and rx

TURN server automatic scale-in logic

The TURN Manager will scale in the TURN server instance(s) when the average CPU usage for all the active TURN servers reaches the lower limit (Scale-in CPU Threshold). It will wait for the time threshold (Scale-in time CPU Threshold) before scaling in the TURN server instance(s). The TURN Manager will cancel the scale-in process if the average CPU usage increases above the lower limit and before reaching the time threshold. When the time threshold is reached and the average CPU usage is still under the lower limit, the TURN server(s) will be marked as inactive (won't be returned in the get endpoint list). However, in the event there is still at least one player accessing the inactive TURN server(s), the system will wait indefinitely until the connection count from the TURN server process goes to zero before scaling the TURN server (and virtual machine).