The 342x Bug: What Happens When You Sum a Pre-Aggregated Field

Anthony G. Tellez7 min read
PythonpandasData EngineeringDebuggingETLData QualityAggregationBug Analysis

While validating the output of a game analytics ETL pipeline I built for Union Raid data from an anime rail shooter guild tool I was building, I found a number that looked too round to be real. The top electric-weakness team had a historical usage count of 5,963,914. That is almost six million for a guild of 32 players over seven seasons of a mobile game. Something was wrong.

Discovering the Problem

The pipeline processes raw battle records from Union Raid seasons, enriches them with boss weakness mappings, runs clustering and meta team scoring, then outputs a Parquet file that feeds a recommendation system. One of the outputs was a JSON file listing the top 3 most-used teams per elemental weakness across all historical seasons.

The electric #1 team was familiar: a well-known meta composition that had dominated the mode for most of the seasons I had data for. It should have been the most-used team. But six million uses in seven seasons from 32 players is arithmetically impossible. Even if every player used that team for every available attack slot in every season, the upper bound would be in the tens of thousands, not millions.

I pulled up the Parquet file in pandas and ran a quick sanity check:

df = pd.read_parquet("ur_history_enhanced_with_markers.parquet")

# What is the total usage across the dataset?
print(df['usage_count'].sum())
# Output: 5,963,914

That confirmed the inflation was dataset-wide, not just in the output JSON. The next step was understanding what usage_count actually represented in the schema.

Tracing the Root Cause

The usage_count field was computed during the enrichment step in create_enhanced_data.py. The relevant calculation was:

# Calculate usage count per team signature
team_usage = df.groupby('team_signature').size().reset_index(name='usage_count')
df = df.merge(team_usage, on='team_signature', how='left')

This groups all battle records by their normalized team composition (sorted alphabetically so position-independent), counts the number of rows per team, and then merges that count back onto every row. The result is that every row in the DataFrame for a given team_signature carries the same usage_count value. That value represents how many times that team composition appeared across the entire dataset.

The per-row value is correct. If team A appeared 343 times in the dataset, every one of those 343 rows has usage_count = 343. That is accurate information, embedded as metadata on each row.

The bug appeared downstream, in the aggregation step that computed total usage for display. The original code inside filter_and_aggregate was doing this:

# WRONG: summing a pre-aggregated field
for (weakness, norm_comp, cluster_id), group in self.df.groupby(
    ['boss_weakness', 'team_comp_normalized', 'cluster_id']
):
    total_usage = group['usage_count'].sum()

If a group has 343 rows and each row has usage_count = 343, then group['usage_count'].sum() returns 343 * 343 = 117,649. The sum multiplies the count by the group size, inflating the result by exactly the number of rows in the group.

Across the entire dataset, the electric #1 team had 17,402 genuine uses. After inflation through this pattern, the computed total was 5,963,914. That is a 342x inflation factor.

Why This Bug Is Subtle

The insidious thing about this class of bug is that the data looks completely clean at every level of inspection you would normally apply.

There are no NaN values in usage_count. There are no negative numbers. The per-row value is not wrong, it accurately describes something real about the data. If you spot-check any individual row, the number makes sense. If you look at the schema, the field is named clearly. Nothing about a naive inspection raises an alarm.

The number 5,963,914 also sounds plausible in a different context. If I were running this analysis on a public game with millions of players and years of data, that would be a reasonable total. The inflation produced a result that looks real at a glance, especially if you do not have strong priors about the expected order of magnitude.

The only signal that something was wrong was domain knowledge about scale. The correct total across all seasons turned out to be 17,402. Five million is 342 times that. For a dataset of a few hundred players over several seasons, six million uses of a single team composition is simply not possible. Without that intuition about the expected order of magnitude, the inflated number looks plausible and the bug ships undetected.

The Fix

The correct approach is to deduplicate first, taking one row per (season, boss_weakness, team_comp_normalized) group before summing. Within each of those groups, every row shares the same usage_count value, so .first() gives the correct group-level count. Then summing across seasons gives the true total:

# CORRECT: deduplicate per season before summing
for (weakness, norm_comp, cluster_id), group in self.df.groupby(
    ['boss_weakness', 'team_comp_normalized', 'cluster_id']
):
    # usage_count is pre-aggregated metadata: same value for all rows
    # with the same (season, boss_weakness, team_comp_normalized).
    # Take .first() per season to get one row per group, then sum across seasons.
    total_usage = group.groupby('season')['usage_count'].first().sum()

With this fix, the electric #1 team correctly shows 17,402 total uses across all seasons. The wrong total was 5,963,914. The correct total is 17,402. That is the 342x inflation, gone in one line change.

The same pattern appears in generate_top3_by_total_usage.py, the script that generates the top-3 teams JSON file for the recommendation UI. The fix there follows the same structure: deduplicate within seasons first, then aggregate across seasons:

# Step 1: deduplicate usage_count per (season, boss_weakness, team_comp_normalized)
season_usage = df.groupby(
    ['season', 'boss_weakness', 'team_comp_normalized', 'team_signature']
).agg({
    'usage_count': 'first',  # All rows in this group share the same value
    'damage': 'mean'
}).reset_index()

# Step 2: sum across seasons
team_aggregation = season_usage.groupby(
    ['boss_weakness', 'team_comp_normalized']
).agg({
    'usage_count': 'sum',  # Now safe: one row per season per team
    'damage': 'mean',
    'team_signature': lambda x: x.value_counts().index[0]
}).reset_index()

The General Pattern to Watch For

This is a specific instance of a broader class of data engineering error: aggregating a field that is already an aggregate.

The pattern appears whenever a group-level statistic is stored at the row level as metadata. The per-row value is not a measurement of that row. It is a description of the group the row belongs to. Summing it across rows double-counts in proportion to group size.

Common scenarios where this appears:

  • A count or size field computed by a groupby and then merged back onto the original rows
  • A total or subtotal field that represents a pre-rolled-up summary, embedded in a detail table
  • A rank or score field computed over a partition and joined back to the partition members
  • Any field whose value is identical for all rows in a group defined by some key combination

The test is simple: if all rows with the same group key have the same value in a field, that field is group-level metadata, not row-level data. Summing it without deduplicating will multiply by group size.

The right question to ask before summing any field is: does this value represent a measurement of this specific row, or does it describe the group this row belongs to? If it describes the group, deduplicate to one row per group before aggregating.

After fixing the pipeline, I added a validation check that explicitly catches this class of inflation:

def validate_usage_count_inflation(self) -> bool:
    sample_teams = self.df.groupby(['boss_weakness', 'team_comp_normalized']).size().nlargest(5)

    for (weakness, team), _ in sample_teams.items():
        team_df = self.df[
            (self.df['boss_weakness'] == weakness) &
            (self.df['team_comp_normalized'] == team)
        ]

        wrong_total = team_df['usage_count'].sum()
        correct_total = team_df.groupby('season')['usage_count'].first().sum()
        inflation_factor = wrong_total / correct_total if correct_total > 0 else 1

        if inflation_factor > 1.5:
            logger.error(
                f"Inflation detected: naive sum={wrong_total:,}, "
                f"correct sum={correct_total:,} ({inflation_factor:.1f}x)"
            )
            return False

    return True

The validation now runs as part of every pipeline execution. Six million becomes 17,402. The leaderboard is correct. The domain impossibility was the first clue, the per-row value being pre-aggregated metadata was the root cause, and deduplicating before summing was the fix.