Building a Game Analytics Pipeline: ETL, TF-IDF, and K-Means on Team Composition Data
I manage team strategy for a guild in an anime rail shooter, and after tracking results manually through a few seasons, I had enough raw data to ask a real optimization question: given a roster of players with wildly different invested units, who should attack which boss, and with what team? This post describes the ETL and ML pipeline I built to turn raw JSON battle exports into an enriched Parquet file that a Flask web app could query for team recommendations.
The Data and the Problem
The raw data is a JSON export per season, one record per attack attempt. Each record contains the season number, the player name, the boss they attacked (mapped by lap and stage to a boss name and elemental weakness), the damage dealt, and the team composition as an array of unit names. The unit name problem is the first real data quality issue: the same unit appears under multiple spellings across seasons. "Snow White" might appear as "Snow White", "SnowWhite", or "Snow white" depending on who entered the data. The pipeline handles this with a normalization dictionary built from a canonical unit registry extracted from prior seasons.
Step 1 is auto-discovery and ingestion. The pipeline scans the input directory for files matching UR*.json, loads them as JSONL records, and filters each file to records matching the expected season number. This catches common data entry mistakes where records from a prior season end up in the wrong file. From there, each record is joined against a boss configuration file that maps (lap, stage) tuples to boss names and elemental weaknesses:
def load_battle_data(self, season: int) -> pd.DataFrame:
battle_file = self.battledata_dir / f"NIKKE_UR{season}_HM.json"
df = pd.read_json(battle_file, lines=True)
required_columns = ['season', 'union', 'mode', 'lap', 'stage', 'player', 'damage', 'is_kill', 'team_comp']
missing_columns = [col for col in required_columns if col not in df.columns]
if missing_columns:
raise ValueError(f"Missing required columns in battle data: {missing_columns}")
df = df[df['season'] == season]
return df
The boss join turns (lap=1, stage=3) into boss_weakness='fire', which is the key dimension for all downstream analysis. After merging across all available seasons, the clean data gets saved to Parquet as the intermediate artifact.
Treating Teams as Documents
The goal is to cluster team compositions, find which clusters represent genuine meta strategies, and score each cluster by its historical performance. The problem with team compositions is representation: a team is an unordered set of 5 unit names, and standard machine learning features want numbers.
TF-IDF turned out to be the right tool here, applied in a context it was not designed for. TF-IDF (Term Frequency-Inverse Document Frequency) normally scores how characteristic a word is in a given document relative to a corpus. High TF-IDF means the word appears often in this document but rarely across the corpus as a whole, so it is probably a meaningful signal about this document's content.
The reframe: each team composition is a "document" and each unit name is a "term." A unit that appears in almost every team (Liter, Crown, typical support units) gets low TF-IDF weight because it is the equivalent of a stop word: it does not differentiate between strategies. A unit that appears only in specific high-performing teams against a particular boss gets high TF-IDF weight, because its presence is actually informative about the strategy being used. The vectorizer encodes this automatically:
tfidf_vectorizer = TfidfVectorizer(
max_features=1000,
stop_words=None,
ngram_range=(1, 2)
)
team_compositions = df['team_comp_normalized'].fillna('').astype(str)
tfidf_matrix = tfidf_vectorizer.fit_transform(team_compositions)
The team_comp_normalized column converts the sorted unit list "Crown|Liter|Modernia|Red Hood|Snow White" into a space-separated string "Crown Liter Modernia Red Hood Snow White" so the vectorizer parses individual unit names as tokens. The ngram_range=(1, 2) setting captures both individual units and unit pairs, which encodes synergies. The pair "Modernia Snow White" means something different strategically than either unit alone.
Dimensionality Reduction and Clustering
The TF-IDF output is high-dimensional, easily a few hundred features across a corpus of 132 units. K-means works poorly in high-dimensional spaces because distances become less meaningful as dimensions increase. The pipeline reduces dimensionality in two steps before clustering.
First, TruncatedSVD reduces the sparse TF-IDF matrix to 50 components. This is essentially LSA (Latent Semantic Analysis) from information retrieval: the components capture latent topics, which in this context correspond to strategic patterns like "fire-weakness DPS core" or "support-heavy sustained damage." Then UMAP reduces those 50 components to 10 dimensions for clustering, and separately to 2 dimensions for visualization:
umap_reducer = umap.UMAP(
n_components=10,
n_neighbors=15,
min_dist=0.1,
random_state=42
)
team_embeddings = umap_reducer.fit_transform(team_embeddings_svd)
UMAP preserves local structure better than PCA at this kind of task, meaning teams that are genuinely similar end up close together in the reduced space even if the similarity is not immediately obvious from the raw unit lists. Two teams running slightly different support configurations around the same DPS core should land in the same cluster. PCA would spread them apart because it respects global variance rather than local neighborhood structure.
K-means then runs on the 10-dimensional UMAP output with 15 clusters. Fifteen was chosen by inspecting the elbow curve and by pragmatic reasoning about the space: five elemental weaknesses times roughly three distinct strategic archetypes per element gives a rough upper bound in the right range.
Meta Team Scoring
Once every team composition has a cluster assignment, the pipeline calculates a meta score to rank teams within and across clusters:
filtered_team_stats['meta_score'] = (
filtered_team_stats['avg_damage'] * 0.4 +
filtered_team_stats['usage_count'] * 0.3 +
filtered_team_stats['damage_efficiency'] * 0.3
)
The weights here are a judgment call. Average damage is the primary signal (40%) because it directly measures what matters. Usage count (30%) captures community consensus: teams that experienced players ran repeatedly are probably reliable. Damage efficiency, defined as average damage divided by average combat power, rewards teams that punch above their gear weight (30%). A team that deals competitive damage without requiring maxed-out units is more valuable for a guild with a range of investment levels than a team that requires top-tier builds.
The pipeline filters out teams with a whale ratio above 30% before scoring. A whale team is one where the damage is explained more by extreme stat investment than by the strategy itself; a Random Forest regression model trained on unit CP values flags these by comparing actual damage against predicted damage for that CP level. This filter prevents the recommendation system from suggesting teams that are only good because of pay-to-win investment, which is not useful advice for most players.
Why Parquet
The output is a single Parquet file rather than a CSV or database table. Parquet is the right choice here for two reasons.
First, schema enforcement. The Flask web app reads this file at startup and expects specific column types: cluster_id is an integer, boss_weakness is a string, is_meta_team is a boolean. Parquet stores type metadata alongside the data, so reading the file in Python automatically produces a DataFrame with the correct types. A CSV file would require explicit type casting on every read, and a bug in the parsing code would produce silent errors rather than type errors.
Second, columnar compression. The UR history data has many repeated string values: union names, boss weakness names, player names. Columnar storage in Parquet compresses these repetitions much more effectively than row-oriented formats because the compression works on entire columns at once rather than individual rows. The clean Parquet file for multiple seasons of data compresses to about 836KB, which is small enough to commit to the repository and fast enough to read into memory in milliseconds at app startup.
The enriched Parquet is the contract between the ETL layer and everything built on top of it; getting that contract right is most of the work.
Related Articles
The 342x Bug: What Happens When You Sum a Pre-Aggregated Field
A specific data engineering pitfall where summing a pre-aggregated metadata field inflates totals by the group size, hiding in plain sight because every individual value is correct.
OCR as a UX Feature: Eliminating Manual Data Entry with Google Cloud Vision
How I used Google Cloud Vision to read damage numbers from battle screenshots, replacing tedious 10-digit manual entry with a single file upload.
Backtesting a Team Allocation Algorithm Across Six Seasons of Game Data
Validating a quantitative team allocation strategy for a mobile game cooperative mode against six seasons of historical data, and what the numbers reveal about what algorithms can and cannot predict.