Backtesting a Team Allocation Algorithm Across Six Seasons of Game Data

Anthony G. Tellez6 min read
PythonGame AnalyticsBacktestingAlgorithm DesignData ScienceOptimizationStatistics

Earlier this month I ran a formal backtest on a planning algorithm I had been using for Union Raid, a cooperative mode in an anime rail shooter. The backtest covered six seasons of data and produced two numbers worth reporting honestly. Allocation accuracy: 95.0%. Damage prediction accuracy: 52.7%. Both tell you something true about what planning algorithms can and cannot do.

The Problem Structure

Union Raid is a constrained optimization problem. Each season, a guild of 32 players faces five elemental bosses across three regular laps, followed by a fourth lap featuring a single infinite-HP boss designed to absorb all remaining attacks. Each player gets exactly three attack slots, for 96 total attacks. Bosses have elemental weaknesses, and matching-element teams deal amplified damage. Lap 1 must clear each boss before Lap 2 begins, and so on through Lap 3. The planning problem is: distribute 96 attacks across five weaknesses and three laps such that coverage meets a threshold at each lap and hard player constraints are satisfied.

Hard constraints are non-negotiable: 96 total teams, exactly 3 per player. The soft constraints are the interesting part: per-boss coverage targets, priority scoring based on HP ratios, and safety margins for high-difficulty bosses.

The problem has genuine historical signal. Boss weaknesses repeat with variation, rosters are relatively stable, and damage patterns carry forward.

V1 and V2: Building the Signal

Version 1 was simple: take the prior season's allocation by weakness, map it to the new season's boss configuration, and target 98% Lap 1 coverage. No HP scaling, no priority scoring.

Against Season 30, this produced 89.4% allocation accuracy. About 10 teams out of 96 went to the wrong weakness. The errors were systematic: the algorithm over-allocated Fire and under-allocated Iron because Season 30 Lap 1 HP was 79% lower than Season 29 Lap 2 HP and V1 had no mechanism to detect that.

V2 added two things. First, an HP scaling factor: for each weakness, divide the target season's Lap 1 HP by the prior season's Lap 2 HP. A ratio above 1.0 means the boss is harder and needs more teams. Second, a boss priority score based on relative HP and historical team availability.

# HP scaling: target Lap1 vs source Lap2
hp_scaling[weakness] = target_hp_lap1[weakness] / source_hp_lap2[weakness]

# Priority: high HP + low historical team count = high priority
priority = (target_hp_lap1[weakness] / avg_hp) / (team_count / 96)

Running V2 against Seasons 31 through 35, allocation accuracy held at 95.0% across all five seasons. The HP scaling was doing real work. Season 35 had Iron and Wind bosses at 1.51x higher HP than the prior season; the algorithm flagged both as high-priority and allocated additional teams. That call was correct.

The Backtest Methodology

The backtest covered Seasons 30 through 35: six seasons, 576 total allocation decisions. Each simulation used only data available before the target season. Two accuracy metrics were tracked separately.

Allocation accuracy: did the right number of teams go to each weakness? Measured as 100 minus mean absolute error normalized to the 96-team pool. A 95% score means roughly 5 teams misallocated on average.

Damage prediction accuracy: did predicted damage match actual? Measured as 100 minus mean absolute percentage error across weaknesses.

Tracking them separately matters because they measure different things. Allocation is a structural prediction. Damage magnitude is a behavioral one.

Why Damage Prediction Is Hard

Season 33 is the clearest example. The algorithm predicted Iron-weakness teams would deal approximately 76.9 billion damage. The guild actually dealt 293.6 billion. That is a +281% error.

The cause was a meta shift. Between Season 32 and Season 33, players discovered team compositions optimized for Iron bosses that did not exist in the training data. Season 34 shows the follow-on problem: biased by Season 33's Iron numbers, the algorithm over-allocated toward Iron. The union diversified instead. Iron accuracy flipped to -60.9%.

This is the structural limitation. Damage prediction requires forecasting whether players will discover and adopt new strategies. There is no historical feature that reliably predicts when a meta shift will happen or how far it will move. Allocation is a different category: which weakness needs teams is a function of boss HP and historical player preferences, both observable. Damage magnitude is a function of player innovation, which is not.

V3: Refinements from the Backtest

The backtest drove a third version, finalized in December 2025. V3 replaces the fixed 1.15x damage growth assumption with historical variance analysis over the prior three seasons, using mean plus half the standard deviation as a conservative estimate. It adds dynamic safety margins: bosses with priority above 5.0 target 103% Lap 1 coverage instead of 98%, which directly addresses the Season 35 Iron failure where the algorithm produced only 92.3% coverage despite a priority score of 8.03. And it uses a three-season weighted average for allocation rather than a single prior season, reducing the overcorrection pattern visible between S33 and S34.

Projected V3 allocation accuracy is 95 to 97%. Damage prediction improves in stable meta seasons and stays bounded by unpredictability in shift seasons.

What This Means for Planning

In practice, roughly 91 of 96 team assignments are correct. The errors are typically one or two teams per weakness and correctable in a manual review pass. Damage numbers are order-of-magnitude estimates for identifying coverage margin, not forecasts of final totals. Coverage compliance was the strongest result: 5 out of 6 seasons hit the 98% Lap 1 target on all bosses. The one failure was a safety margin that V3 now applies automatically.

The structure here is identical to validating a quantitative trading strategy: a systematic rule checked against historical data, with the same failure modes of overfitting and regime change. Season 33's meta shift is a regime change. No historical feature predicts when players will discover a new strategy or how far it will move. The correct response is to scope the algorithm's validated domain clearly. Allocation is that domain. Damage prediction is not.

Running the backtest was worth the effort not because the results were surprising, but because they replaced intuition with numbers. 95.0% allocation accuracy is a defensible claim. 52.7% damage accuracy is the honest one that defines the limits.