Use best practices for game integration

Last updated on June 12, 2024

Overview

Following best practices for game integration can help prevent issues from occurring, minimize the impact of issues that do occur, and provide players a better user experience.

This article covers some best practices to use for game integration.

Connectivity issues

From your game's main menu, several calls will be made to initialize the game e.g, login, fetch user profiles, fetch cloud save information, fetch a store, fetch user entitlements, and connect to a lobby. Issues may arise during these calls preventing operability. These issues will need to be communicated to the player.

DON'T: Combine all errors into one handler

If you use an error handling process that deals with every possible issue, you won't be able to tell the player what specifically went wrong and instead can only display to them a generic error message. This will leave players without any troubleshooting options and will drive them to submit support tickets.

DO: Limit player activity, implement failure policy

Your game should handle errors from the backend and gracefully degrade the player experience by limiting what players can do based on the error. For example:

If a login failure occurs, show an appropriate error message and implement a countdown UI to attempt to reconnect after one minute and five minutes while performing a backoff retry loop.
If a failure to fetch cloud saves occurs, use the default offline values for the game while attempting to get the data with a retry loop in the background.
If a failure to connect to a lobby occurs, disable the multiplayer areas of the game and display a message to explain to the player that multiplayer is temporarily inaccessible. Perform a retry background loop in the background while the player is able to explore other areas such as the store, their loadouts, etc.

You should implement a "failure policy" where you can set instructions for each task (e.g., login) that determines what you want the client to do if the task fails (e.g, log and continue, halt the login flow, display an error dialog, etc.). You can make this even more powerful If you control this failure policy from the backend without changing the game logic (e.g, fetching the latest failure policy as a JSON from a cloud save).

Accidental DDoS attacks

Online games are sometimes targets for distributed denial-of-service (DDoS) attacks, which will prevent players from using your game's online features. Sometimes, the DDoS accidentally comes from the game logic, due to innocent bugs or unoptimized calling patterns. This section covers best practices for preventing accidental or real DDoS attacks, and handling them when they occur.

DON'T: Put DDoS-exploitable logic in your game

To help prevent a DDoS attack from happening, avoid using logic that calls to the backend automatically. For example, fetching cloud save game records at midnight UTC to get the next week's rolling. This logic will result in an accidental DDoS attack if it's not handled properly.

DO: Use manual, local, and on-demand logic

Instead of automatic fetches by a timer, do the cloud save check in a more on-demand fashion, such as when reloading a level or when a player goes back to the main menu. Also, do local caching of data and batching of async calls when appropriate. This is less exploitable for DDoS attacks.

DO: Communicate early and often

In the event of a DDoS attack, to alleviate confusion and frustration from your players, be sure to keep them as up to date as possible as early as possible.

DO: Prevent access to affected areas

Contain the damage being done to only the areas that are attacked by gating access to those areas in your game.

DO: Perform background retries

Set up a retry to continuously run in the background to attempt to regain access to online features. This will get you back up and running as soon as possible.

See the next section for best practices when using retries in your game.

Background retries

You can implement background retries in your game for players to regain access to online features as quickly as possible. This section will go over best practices when using background retries.

DON'T: Use retries for every error

Not every error will warrant a retry. For example, when new players try to fetch the cloud save player record, they will get a Player Record Not Found error (18022). Since this is expected because new players won't have records, this does not require a retry.

DO: Use backoff retry loops

Your game should implement a backoff retry loop to not overwhelm the backend with aggressive retries.

A backoff retry loop increases your retry timeout exponentially after each failed call attempt, then adds in a small amount of variance to mitigate unintentional coordinated retry attempts. For more information on backoff retry loops, see this "Retry with backoff pattern" article from Amazon.

Coming soon

Future AGS SDK releases will have backoff retry logic embedded into some of the APIs. Details will be available soon.

Player communication

It is best to communicate to your players as much information as possible to alleviate frustration and provide a better experience.

DON'T: Generalize communication

Showing broad messages to describe multiple steps in your game's processes might lead to your player thinking nothing is happening, leading them to lose patience.

For example, matchmaking requires multiple steps to complete. If you use a message such as "Matchmaking in progress, please wait", the player won't know what part of the process matchmaking is in. If matchmaking is taking some time to find a match, this may lead to the player thinking it's stuck or an error has occurred, but it could just be looking for additional players. This might cause the player to cancel matchmaking and possibly submit a support ticket.

DO: Tell the player what's happening

Matchmaking has different states, such as searching, match found, joining, waiting for server, etc. Communicate to the player what's going on using UI elements so they know if they have to do something or just need to wait.

This applies to error states, too. Let the player know relevant information where possible so they can either attempt to fix the problem (e.g., poor connection) or know that they can't (e.g., internal server error).

Providing this information will also help them communicate with your support team. Use this player communication philosophy in your game wherever it's applicable.

Security, access control, and failsafes

You want to make sure you can minimize the impact of something going wrong by implementing feature access controls and using security best practices.

DO: Use a feature kill switch with Cloud Save game records

You should use Cloud Save game records to create a feature kill switch that will quickly disable access to certain portions and features of your game. This will allow you to prevent the spread of any damage from issues that arise.

Make sure that you have a fallback plan if the game records in Cloud Save are not accessible from the backend. The priority will be as follows (from highest to lowest):

Cloud save
Launch parameters (debug mode only)
Local configurations
Default values

Coming soon

We are preparing a guide to implement a feature kill switch reference implementation using Cloud Save.

DO: Double check all your roles and permissions

You want to make sure that all users have the lowest necessary permissions. For example, if your game is competitive multiplayer or you leverage dedicated servers for your game, you don't want to have user permissions to update stats. You want to ensure that only the dedicated servers have this update permissions.

AccelByte service use

This section covers general best practices for administrators and developers using services offered by AccelByte.

DO: Understand your call dependencies

You need to understand the backend calls that's being made from your game client, game server, custom service, and the dependency chain if necessary.

Examples:

All calls to the AccelByte backend will most likely be blocked until you are properly logged in or have a valid AccelByte token, whether that's a user credential login or a login with a client ID and secret.
Your game probably does not need to create dependencies between fetching a player inventory or player records with getting store information.
Fetching a store and pricing information before making a checkout or purchase call.

We suggest being careful about assumptions on race conditions, especially on callbacks that can come out of order from the backend.

Examples:

In development, your async calls to AccelByte APIs may complete very quickly and in a seemingly deterministic order, but when you run it in a different environment or under a different load profile, you realize that:
- the callbacks are now firing in a different order,
- someone had accidentally made a stateful-based assumption about the order of the callbacks resolving.
Dealing with callbacks that sometimes return immediately (e.g., they fire before the requesting function exits because a result is cached) but your calling code is assuming the callback/delegate will trigger "later".

DO: Analyze and optimize your AccelByte service usage

Just like you would with other systems, you should perform a health check on your online service access patterns under different scenarios and optimize them accordingly.

Please reach out to AccelByte so we can coordinate doing a Well-Architected Review of your game, which involves us profiling your game to detect unoptimized usages (e.g., unnecessary sequential call chains, large payloads, unhandled error responses, etc.).

We will look into providing this profiling mechanism as a self-service tool in Q2 2024.

Examples:

If you are using APIs that have pagination support, please use pagination properly. Using a large number for max=<number> will strain the backend and could potentially make your game fail certification during latency tests or create a bad user experience for players with poor internet connections.
- The SDK will show a warning or may error out if some APIs are not used properly, such as the use of pagination.
Attempt to compare locally cached data before issuing a request to modify that data in the backend. For example, if you are attempting to update a cloud save record, there should be a check to ensure that data is actually being updated in that record, saving a request to the backend and preventing unnecessary strain on services. Another case would be updating presence information.
- We suggest implementing a dirty flag on your objects to know if it's time to flush or to do the update.
Batch multiple sequential API calls in a configurable timed interval where it makes sense. For example, frequent bulk in-game actions that affect a player's statistics.
- If you think we are missing a Bulk API, please contact AccelByte and we will be happy to discuss!

Console certification

Dealing with certification can be challenging, and we want to help ease the pain.

DO: Review AccelByte certification guidelines

Coming soon

We are preparing a list of steps you need to take to ensure successful certification flow, such as:

Managing different namespaces, environments, and configurations.
Things you need to know to pass Console Certification for PlayStation and Xbox.
Rolling out a new game patch in a cross-platform game.

Be in the lookout for the guideline in Q4 2023!

Launch and post-launch

This section offers general advice for having the best experience during and after your game's launch.

DO: Hope for the best, plan for the worst

While we all hope for a successful launch, plan ahead for if something goes wrong. Use your imagination to make educated guesses about what could happen and write them down. When you have listed out your worst fears, ask yourself the following questions:

How are you going to detect if something goes wrong?
Are there mitigations in place, such as the ones offered in this article?
How do you communicate with the players that the issue is identified and you are working on it?
How fast can the mitigation be put in place and who is doing it?
What's the plan after the mitigation is in place? Are our employees aware that live issues are more important than future work? How are you going to track the work on live issues?

For example:

20% of the servers are crashing and the core dump process is hammering the core collector service causing it to crash. The most popular platform rolled back the client version because the latest has some vulnerability. One of your employees was sleep deprived in week 2 of launch and accidentally deleted the wrong Amazon Web Services (AWS) account.

Use the advice offered in this article to help prevent, plan for, and deal with these types of situations.

DO: Load test, identify breaking points, and set mitigation

Load test, soak test, spike test, and peak test. There are a lot of variations to put pressure on a system. In general, maximum capacity and maximum rate are two metrics that you should test the limits of. Add monitoring to these key operation metrics and add an alert when the metrics get close to their max values.

For example:

Your game has 10 players per dedicated server. You have an agreement with AWS for 100 virtual machines (VMs) and each VM can run 50 dedicated servers. You have now defined the maximum capacity (10 players x 100 VMs x 50 dedicated servers = 50,000 players). Assuming the VM and dedicated server allocation is on average 100 dedicated servers per minute, the rate is 10 players x 100 dedicated servers = 1000 players can enter the game per minute. To go from 0 players to 50,000 players, it will take 50 minutes. What does the player see for these 50 minutes?

Bonus: Matchmaking design tips

Here are some bonus design tips unrelated to online service integration that might help improve your online game.

If you are creating a lot of distinct queues based on your matchmaking rules and your game's design, consider implementing a Quick Match to help get players into fuller matches.

Understand your game and the options you have to control what is important for your game. Do you want players getting into matches quickly (low wait time), or do you let players wait longer in the queue for higher match quality?

"A good matchmaker is not to make the best matches; it is to minimize the number of bad matches. Players will not post on Reddit because they had a good match, but they will post if they are in a terrible match." - Laurent Bourcier, Moonshot studio.

Matchmaking can be very specific to game design, but in general, we want more players to play together. The human experience is better than playing with AI.

There are two good approaches to matching more players together:

Leverage a flexing rule to get players playing together first, and then flex down to a smaller number of players.
Get players in, and leverage backfill to get players joining.

{

    "alliance": {

        "min_number": 1,

        "max_number": 1,

        "player_min_number": 4,

        "player_max_number": 4

    },
...

    "alliance_flexing_rule": [

        {

            "duration": 10,

            "min_number": 1,

            "max_number": 1,

            "player_min_number": 3,

            "player_max_number": 4

        },

        {

            "duration": 20,

            "min_number": 1,

            "max_number": 1,

            "player_min_number": 2,

            "player_max_number": 4

        },

        {

            "duration": 30,

            "min_number": 1,

            "max_number": 1,

            "player_min_number": 1,

            "player_max_number": 4

        }

    ],

    "auto_backfill": true,

}

Last, but not least, leverage your solo players to round out parties and fill non-full matches with backfill and less-restrictive matchmaking requirements (such as a "Quick Match" option).

Special thanks to our friends at Final Strike Games and Dreamhaven for helping us write this article.

Overview​

Connectivity issues​

DON'T: Combine all errors into one handler​

DO: Limit player activity, implement failure policy​

Accidental DDoS attacks​

DON'T: Put DDoS-exploitable logic in your game​

DO: Use manual, local, and on-demand logic​

DO: Communicate early and often​

DO: Prevent access to affected areas​

DO: Perform background retries​

Background retries​

DON'T: Use retries for every error​

DO: Use backoff retry loops​

Player communication​

DON'T: Generalize communication​

DO: Tell the player what's happening​

Security, access control, and failsafes​

DO: Use a feature kill switch with Cloud Save game records​

DO: Double check all your roles and permissions​

AccelByte service use​

DO: Understand your call dependencies​

DO: Analyze and optimize your AccelByte service usage​

Console certification​

DO: Review AccelByte certification guidelines​

Launch and post-launch​

DO: Hope for the best, plan for the worst​

DO: Load test, identify breaking points, and set mitigation​

Bonus: Matchmaking design tips​

On this page

Overview

Connectivity issues

DON'T: Combine all errors into one handler

DO: Limit player activity, implement failure policy

Accidental DDoS attacks

DON'T: Put DDoS-exploitable logic in your game

DO: Use manual, local, and on-demand logic

DO: Communicate early and often

DO: Prevent access to affected areas

DO: Perform background retries

Background retries

DON'T: Use retries for every error

DO: Use backoff retry loops

Player communication

DON'T: Generalize communication

DO: Tell the player what's happening

Security, access control, and failsafes

DO: Use a feature kill switch with Cloud Save game records

DO: Double check all your roles and permissions

AccelByte service use

DO: Understand your call dependencies

DO: Analyze and optimize your AccelByte service usage

Console certification

DO: Review AccelByte certification guidelines

Launch and post-launch

DO: Hope for the best, plan for the worst

DO: Load test, identify breaking points, and set mitigation

Bonus: Matchmaking design tips