OCR as a UX Feature: Eliminating Manual Data Entry with Google Cloud Vision
Typing a 10-digit number sounds trivial. It is not, in practice. People fat-finger digits, read numbers wrong, and sometimes just give up and skip recording the result. After a few weeks of garbage data from a mock battle recorder I had built for an anime rail shooter, I added Google Cloud Vision so players could upload a screenshot instead.
The context: guild members needed to record their damage output across union raid practice sessions so we could optimize team compositions. After each mock battle, players had to type a 10-digit damage number from the results screen into a web form. About one in five manual entries required a correction.
The Problem with Manual Entry
The battle result screen displays a "TOTAL DAMAGE" figure that looks something like 1,234,567,890. Players were copying that number by hand into the recorder. The failure modes were consistent: transposed digits, dropped digits, and wrong numbers entirely when players accidentally reported burst damage instead of total damage.
The form itself was fine. The problem was asking humans to accurately transcribe a 10-digit number from a mobile screenshot. That is a fundamentally bad UX pattern regardless of how good your form validation is. OCR turns the transcription problem into a file upload problem: the user drags a screenshot onto the form and the computer reads the number.
The key constraint was that game screenshots are noisy. The result screen contains unit names, level indicators, buff descriptions, elapsed time, and a dozen other numbers that are not the damage value. A naive "extract all numbers" approach would return the wrong result most of the time. I needed a detection strategy that could reliably isolate the total damage figure from everything else on screen.
The Vision API Call
I used text_detection rather than document_text_detection. The distinction matters: document detection is optimized for dense, structured text like PDFs and forms. Text detection works better for UI overlays, game interfaces, and images where text is sparse and positional rather than flowing. In testing on screenshots from the game, text_detection produced cleaner results with less noise around icon labels and partial strings.
The Flask endpoint receives a multipart file upload and passes the image bytes directly to the Vision API:
from google.cloud import vision
import re
@app.route('/api/ocr/process', methods=['POST'])
@login_required
def process_ocr():
file = request.files['screenshot']
image_data = file.read()
client = vision.ImageAnnotatorClient()
image = vision.Image(content=image_data)
response = client.text_detection(image=image)
texts = response.text_annotations
if response.error.message:
return jsonify({'success': False, 'error': 'OCR service unavailable'}), 500
full_text = texts[0].description if texts else ""
The first element of text_annotations is the full detected text block, concatenated in reading order. Subsequent elements are individual word-level annotations with bounding box coordinates. For this use case I only needed the full text block.
Extracting the Damage Value
The full text from a screenshot looks something like:
Battle Records
Total Battle Time: 03:00
I Tove
II Modernia
III Scarlet
TOTAL DAMAGE
1,234,567,890
No effect in use
LV.522
Five regex patterns run in priority order against this text:
damage_candidates = []
# Priority 1: Number immediately following "TOTAL DAMAGE"
total_damage_pattern = r'TOTAL\s+DAMAGE[:\s]*(\d{1,3}(?:,\d{3})*|\d{8,})'
total_damage_match = re.search(total_damage_pattern, full_text, re.IGNORECASE)
if total_damage_match:
damage_num = total_damage_match.group(1).replace(',', '')
damage_candidates.append((int(damage_num), len(damage_num), 'total_damage'))
# Priority 2: Number following "Damage:" label
damage_label_pattern = r'Damage[:\s]*(\d{1,3}(?:,\d{3})*|\d{8,})'
damage_label_match = re.search(damage_label_pattern, full_text, re.IGNORECASE)
if damage_label_match:
damage_num = damage_label_match.group(1).replace(',', '')
damage_candidates.append((int(damage_num), len(damage_num), 'damage_label'))
# Priority 3: Raw 8+ digit numbers (fallback)
large_numbers = re.findall(r'\b\d{8,}\b', full_text)
for num in large_numbers:
damage_candidates.append((int(num), len(num), 'large_number'))
# Priority 4: Comma-formatted numbers
comma_numbers = re.findall(r'\b\d{1,3}(?:,\d{3})+\b', full_text)
for num in comma_numbers:
clean_num = int(num.replace(',', ''))
damage_candidates.append((clean_num, len(str(clean_num)), 'comma_formatted'))
priority_order = {'total_damage': 0, 'damage_label': 1, 'large_number': 2, 'comma_formatted': 3}
damage_candidates.sort(key=lambda x: (priority_order.get(x[2], 5), -x[1]))
best_damage = damage_candidates[0][0]
best_type = damage_candidates[0][2]
return jsonify({
'success': True,
'damage': str(best_damage),
'confidence': 'high' if best_type in ['total_damage', 'damage_label'] else 'medium'
})
The priority ordering is what makes this work reliably. Pattern 1 catches the most common layout where Vision reads "TOTAL DAMAGE" and the number as adjacent text. Patterns 3 and 4 are fallbacks for screenshots where OCR splits the label and value across separate text blocks. The fallback is less reliable but better than failing entirely.
Credential Setup
Authentication uses a service account JSON file via the GOOGLE_APPLICATION_CREDENTIALS environment variable. The Vision client picks this up automatically at initialization:
client = vision.ImageAnnotatorClient()
No explicit credential loading is needed in application code as long as the environment variable points to a valid service account JSON. In development I kept the file as credentials.json in the project directory. In production the variable is set in the environment.
This is where the approach gets operationally messy. Service account credentials are long-lived and need rotation. The JSON file contains a private key that must be kept out of version control. A secrets manager (GCP Secret Manager, AWS Secrets Manager) is the right long-term solution, but for an internal guild tool handling non-sensitive game data, an environment variable was acceptable.
What Actually Works and What Does Not
Accuracy on clean screenshots is around 95%. When a player takes a screenshot at a good angle on a modern phone screen, the OCR reads the damage value correctly almost every time. "TOTAL DAMAGE" appears consistently on the result screen, and the number immediately follows it in the text layout.
Accuracy drops significantly on low-quality screenshots. Compressed images from guild Discord shares, screenshots taken in low-light or with glare, and images that have been resized or cropped before upload all cause problems. The Vision API will often still detect text, but the damage number comes back garbled or the "TOTAL DAMAGE" label is missing, pushing the algorithm into its less reliable fallback patterns. When fallbacks trigger, I return confidence: medium in the response rather than high, which the frontend displays to the user.
The fallback path in the UI is important. If OCR fails or returns low confidence, the form field stays editable and the player enters the number manually. The feature saves time for the common case without breaking the uncommon one.
Cost
Google's Vision API pricing as of this writing is $1.50 per 1,000 images for text detection. For a guild tool tracking weekly union raid practice with 20 to 30 active members, that works out to a few cents per week. Cost is not a factor at this scale. If this were a public tool with thousands of daily users, the math changes, but the pricing is competitive with other cloud OCR services.
The Actual UX Outcome
The change in user behavior was immediate. Before OCR: players would record damage numbers, some wouldn't bother, and error rates on submitted data were high. After OCR: players drag the screenshot from their phone's gallery onto the form. The damage field populates automatically. They confirm and submit.
Manual entry errors dropped to near zero for screenshot submissions. The players who previously skipped recording because entry was tedious now record consistently. The data quality improvement was more valuable than I expected because it made the downstream team composition analysis trustworthy rather than approximate.
Any time you are asking a user to manually transcribe information that already exists as pixels on a screen, OCR is the right tool. The argument for adding it is not about technical elegance. It is that you are paying a UX cost by leaving it out, and the technical cost of adding it is lower than you think.
Related Articles
Building a Game Analytics Pipeline: ETL, TF-IDF, and K-Means on Team Composition Data
How I applied document similarity techniques to mobile game team compositions, using TF-IDF vectorization and UMAP clustering to identify meta strategies across multiple seasons of Union Raid data.
Backtesting a Team Allocation Algorithm Across Six Seasons of Game Data
Validating a quantitative team allocation strategy for a mobile game cooperative mode against six seasons of historical data, and what the numbers reveal about what algorithms can and cannot predict.
The 342x Bug: What Happens When You Sum a Pre-Aggregated Field
A specific data engineering pitfall where summing a pre-aggregated metadata field inflates totals by the group size, hiding in plain sight because every individual value is correct.