Machine Learning for Malware Domain Detection: Lessons from a Patent

Anthony G. Tellez9 min read
Machine LearningMalwareDNSSecurityPythonSplunkThreat DetectionMalware DetectionFeature EngineeringPatent2020

In October 2020 I filed patent application 17/072,921 with Splunk Inc., titled "Providing Machine Learning Models for Classifying Domain Names for Malware Detection." This post tells the story behind that work: what problem we were actually trying to solve, how we engineered the system, what broke along the way, and where the approach has real limits.

The Problem with Blocking Domains

DNS is one of the most reliable attack surfaces in enterprise environments. Almost every piece of malware, command-and-control infrastructure, and data exfiltration channel touches DNS at some point. Threat actors know this, which is why domain generation algorithms (DGAs), fast flux DNS, and domain fronting are standard parts of adversary toolkits.

The traditional defenses, blocklists and regex signatures, are fundamentally reactive. A blocklist only knows about domains someone has already seen and reported. By the time a DGA-generated domain makes it onto a threat intel feed, the malware using it has often already rotated to a new domain. Regex patterns work for known bad patterns but fall apart against more sophisticated DGAs that produce lexically plausible strings, or against newly registered lookalike domains that do not match any prior pattern.

The gap I wanted to close: identifying malicious domains at query time, before they appear on any blocklist, using features observable from the domain name and its DNS behavior alone.

Feature Engineering: What a Domain Tells You

The first major insight was that malicious domains have a distinct statistical signature, even when they are designed not to look like one. We built features across three categories.

Domain Lexical Features

The raw string of the domain carries a surprising amount of signal:

  • Length and label structure. Legitimate domains registered for human use tend to be short and memorable. DGA domains and infrastructure domains trend longer, often with deep subdomain hierarchies.
  • Shannon entropy. The character-level entropy of a domain name is a strong DGA signal. Human-chosen names have low entropy; DGA output tends toward high entropy. The catch is that some legitimate technical domains also have high entropy, including UUIDs in cloud infrastructure and CDN hostnames, which creates a false positive problem that took real effort to address.
  • Character distribution. We looked at consonant-to-vowel ratios, digit presence, hyphen frequency, and the proportion of rare trigrams. Human-readable names in English follow predictable phonotactic patterns. DGA output typically does not.
  • N-gram frequency scoring. We trained character-level n-gram language models on a large corpus of known-legitimate domains and scored new domains against that model. A domain that looks nothing like human language gets a low score. This proved more robust than raw entropy because it implicitly captures the structure of real words without requiring exact pattern matches.
  • TLD and second-level domain characteristics. Newly-abused TLDs, country-code TLDs used with suspicious second-level names, and domains registered in bulk with sequential naming patterns all carry signal.

WHOIS and Registration Features

Registration metadata adds another dimension:

  • Domain age at query time. Malicious infrastructure is often short-lived. A domain queried within hours or days of registration is worth more scrutiny than one registered five years ago.
  • Registrar concentration. Some registrars are disproportionately represented in malicious infrastructure due to pricing, privacy defaults, and abuse policy enforcement. This is a weak signal on its own but useful in combination with others.
  • Registration velocity. Bulk domain registrations, where many domains are registered at the same time by the same entity, are a pattern associated with DGA seeding and phishing kit deployment.

The problem with WHOIS features is availability. Privacy-protecting registration services and GDPR compliance have progressively eroded access to WHOIS data, which means models that rely heavily on registration metadata become less reliable over time. We treated these as supplementary features rather than load-bearing ones.

DNS Behavioral Features

Observed DNS behavior in the network adds features that are invisible from the domain string alone:

  • TTL values. Fast flux DNS uses very short TTLs to rotate IP addresses rapidly. A domain resolving to a new IP every 30 seconds is not behaving like a normal CDN.
  • Query velocity. How often is this domain queried, and by how many distinct internal hosts? A domain queried by a single machine late at night, never queried before, looks different from a domain with consistent query volume across the organization.
  • Answer diversity. The number of distinct IP addresses a domain resolves to over a short window is a signal for both fast flux and CDN infrastructure. Distinguishing between the two requires combining this with other features.
  • Geographic distribution of resolved IPs. Legitimate businesses often resolve to geographically dispersed infrastructure. Malicious actors running cheap compromised hosts often resolve to a small number of concentrated IP ranges.

Model Architecture Choices

Given the feature set, we had to decide how to combine them into a classifier. We evaluated several approaches.

A single gradient-boosted tree over the full feature vector worked reasonably well as a baseline. The problem was that different feature groups have different reliability profiles and different latency characteristics. WHOIS features might be unavailable or stale. DNS behavioral features require historical data that is not always present for newly-seen domains. A single model that sees all features at once either ignores missing values or degrades unpredictably when they are absent.

We moved to an ensemble approach: separate models for lexical features, behavioral features, and registration features, with a meta-learner combining their outputs. This architecture was more maintainable and more interpretable. When a domain was flagged, you could say it was flagged primarily by the lexical model because of unusual character distribution, rather than just "the model said so." Explainability mattered in enterprise environments where a false positive could block a production service and require a postmortem.

Class imbalance was the other major architectural challenge. In a real enterprise DNS log, the overwhelming majority of queries are to legitimate domains. Malicious domains might be a fraction of a percent of total query volume. Naive classifiers trained on this distribution learn to predict "legitimate" for everything and achieve 99.9% accuracy while being completely useless.

We addressed this through a combination of techniques: stratified sampling during training, class-weighted loss functions, and calibrated probability outputs rather than hard thresholds. The calibrated probabilities were especially important. Rather than outputting a binary decision, the model output a confidence score that could be thresholded differently depending on the risk tolerance of each deployment. A SOC doing active threat hunting might set a lower threshold. An automated blocking rule needed a higher one.

Production Lessons at Scale

Getting a model to work in a notebook is one problem. Getting it to work in production at Splunk scale is a different problem.

Latency requirements. DNS classification needs to happen fast enough to be useful. If the classification job runs on a batch of yesterday's DNS logs, you are doing threat hunting, not prevention. Real-time or near-real-time classification required attention to inference latency. Our feature extraction pipeline was often slower than model inference itself, particularly WHOIS lookups and DNS behavioral aggregations that required joins across large time windows.

Model drift. Attackers adapt. A DGA classification model trained six months ago will be less effective against DGA implementations that have evolved since then. We monitored model performance against incoming threat intel feeds: if known-bad domains started scoring as low-risk, that was a signal the model was drifting. Continuous retraining with fresh labeled data was necessary, not a one-time activity.

False positive cost. In consumer security products, a false positive is annoying. In an enterprise environment, a false positive can block a critical business system. A false positive on a domain used by a financial data provider can trigger regulatory notifications. We spent significant effort understanding where our model was wrong, not just where it was right. Systematic false positive analysis shaped both the feature engineering and the confidence thresholding decisions.

What Made It Novel Enough to Patent

The patent claim is not that machine learning can detect malicious domains, because that is already well established in academic literature. The novelty is in the architecture for deploying these models at scale within a security analytics platform: specifically, how we structure the feature extraction pipeline to operate on streaming DNS data, how we combine models trained at different granularities (domain string versus organizational behavior context), and how we integrate model outputs into actionable detection workflows within the broader Splunk ecosystem.

The combination of streaming feature extraction, ensemble scoring, and integration with broader security context in a way that was operationally deployable was the differentiated piece. Closing the gap between research-grade detection and enterprise-deployable detection at scale is harder than it sounds.

Honest Limitations

I would not claim this system solved malicious domain detection. It shifted the arms race a few steps, which is the realistic goal.

Sophisticated adversaries who know they are being ML-classified can adapt their DGAs to produce more lexically natural output, register domains well in advance of use to age them past detection thresholds, and route traffic through legitimate infrastructure such as compromised sites and trusted cloud services that will never be blocked by a domain classifier alone. Against those adversaries, domain-name-based classification is a first filter, not a final answer.

The WHOIS feature erosion problem is getting worse, not better. Privacy regulations that are good policy for users are genuinely difficult for threat intelligence.

And the false positive problem in enterprise environments is harder than any academic paper on DGA detection acknowledges. Real production deployments involve legacy systems with unusual domain naming, third-party vendors with legitimately strange infrastructure, and internal DNS namespaces that look like DGA output to an external model. Managing those edge cases consumed more engineering time than the model development itself.

The work was worth doing. It was also a good reminder that the hardest part of applied machine learning is rarely the model.