Configure TURN server autoscaling
Overview
The peer-to-peer (P2P) offerings as part of AccelByte Gaming Services (AGS) Play includes the use of Traversal Using Relays around NAT (TURN) servers. TURN servers play an essential role in facilitating direct communication between players, especially when Network Address Translation (NAT) or firewall restrictions are factors. This article will provide an overview of TURN servers and their role within AGS Play, as well as steps to help you configure them.
Role of TURN servers
P2P communication allows players' devices to connect directly with each other, reducing the load on central servers and minimizing latency; however, direct connections are often obstructed by NAT and firewall configurations, which can prevent peers from establishing a direct link. TURN servers are used to circumvent this.
P2P Communication
TURN servers act as intermediaries that relay traffic between peers when direct connections fail. By doing so, they ensure that data can flow smoothly between players, regardless of their network environments. This relaying capability is critical in scenarios where other techniques, like Session Traversal Utilities for NAT (STUN), are insufficient to establish a direct connection.
Benefits of using TURN servers with AGS
- Enhanced Connectivity: TURN servers help ensure P2P players can consistently connect with each other regardless of network barriers.
- Low Latency: By facilitating direct P2P connections, TURN servers help minimize player latency.
- Scalability: TURN servers help distribute the network load, making the system more scalable and capable of handling a large number of simultaneous connections.
- Reliability: With TURN servers, games can maintain stable connections even in restrictive network environments.
Manage TURN servers in the AGS Admin Portal
Game admins and developers can manage TURN server configurations in the AGS Admin Portal. The TURN server management menu is available in the Publisher namespace under AGS Settings > TURN Server Configurations.
TURN servers are available for both AGS Shared and Private Cloud environments. By default, a TURN server will be deployed in a specific region. If you require deployment in another region, you can request it by clicking Submit a Ticket. You will be notified once the server has been deployed for your environment.
In the Private Cloud environment, under your Publisher Namespace, you can view TURN servers listed in the Turn Server Regions section on the Turn Server Configurations page. To configure a TURN server, click the edit icon next to the region and fill out the following fields in the pop-up window:
- Min. TURN Servers: The minimum number of TURN servers that will be deployed in a specific region.
- Max. TURN Servers: The maximum number of TURN servers that will be deployed in a specific region.
- Threshold Autoscale: Defines the autoscaling policy for TURN servers, either Normal or Aggressive:
- Normal Scaling
- Scale-in CPU threshold: 20%
- Scale-in time CPU threshold: 60 Seconds
- Scale-out CPU threshold: 80%
- Scale-out time CPU threshold: 60 Seconds
- Aggressive Scaling
- Scale-in CPU threshold: 40%
- Scale-in time CPU threshold: 60 Seconds
- Scale-out CPU threshold: 60%
- Scale-out time CPU threshold: 30 Seconds
- Normal Scaling
- Enable Bandwidth Threshold: When enabled, the TURN manager will scale TURN servers in or out based on either CPU utilization or bandwidth usage, whichever threshold is reached first
In the Shared Cloud environment, TURN server configurations are managed by AccelByte and are not configurable from your Studio Namespace.
Scaling Parameter Details
- Scale-out CPU threshold: The upper limit of CPU usage in percentage (%). For example, if it is configured to 80%, if the CPU average value of the TURN server on the region hits 85%, then the TURN Manager will need to deploy a new instance.
- Scale-out time CPU threshold: The time limit for the scale-out process in seconds. For example, if it is configured to 60 seconds, if the CPU average value hits the upper limit of 60 seconds, then the TURN Manager will process the scale-out the TURN Server numbers.
- Scale-in CPU threshold: The lower limit of CPU usage in percentage (%). For example, if it is configured to 20%, if the CPU average value of the TURN server on the region hits 15%, then the TURN Manager will mark the TURN server to be removed (scale-in).
- Scale-in time CPU threshold: The time limit for the scale-in process in seconds. For example, if it is configured to 60 seconds, if the CPU average value hits the lower limit in 60 seconds, then the TURN Manager will process the scale-in the TURN server numbers.
- Scale-in Bandwidth Limit (MB): The lower limit of the average network Tx = Transmit and Rx = Receive in megabytes (MB). For example, if it is configured to 1 MB, then when the average network Tx and Rx in 30 seconds (default value) is 1 MB, the TURN Manager will process to scale in the TURN server numbers.
- Scale-out Bandwidth Limit (MB): The upper limit of the average network Tx = Transmit and Rx = Receive in Megabytes (MB). For example, if it is configured to 10 MB when the average network Tx and Rx in 30 seconds (default value) reaches 10 MB, then the TURN Manager will process to scale out the TURN server numbers.
When players experience an issue with a specific TURN server instance, the admin can deactivate the instance and the backend will spawn a new TURN server to replace it.
The default TURN server instance resources are using 512 MB of CPU and 1 GB of memory. We tested 1500 concurrent users (CCU) access to one TURN server with a default instance, and the result were:
- The TURN server reached 60% CPU usage
- Memory increased from 237 MB to 307 MB
- Network usage averaged 2.76 MB/s tx and rx
TURN server automatic scale-in logic
The TURN Manager will scale in the TURN server instance(s) when the average CPU usage for all the active TURN servers reaches the lower limit (Scale-in CPU Threshold). It will wait for the time threshold (Scale-in time CPU Threshold) before scaling in the TURN server instance(s). The TURN Manager will cancel the scale-in process if the average CPU usage increases above the lower limit and before reaching the time threshold. When the time threshold is reached and the average CPU usage is still under the lower limit, the TURN server(s) will be marked as inactive (won't be returned in the get endpoint list). However, in the event there is still at least one player accessing the inactive TURN server(s), the system will wait indefinitely until the connection count from the TURN server process goes to zero before scaling the TURN server (and virtual machine).