SuriCon 2017 - Hunting BotNets: Suricata Advanced Security Analytics

Anthony G. Tellez4 min read
SuricataSplunkBotnetMachine LearningSecurityIDSSuriConConference2017

The gap in most SOC environments isn't data. Suricata generates rich telemetry — flow records, alerts, protocol metadata — and once that data is in Splunk, you have more than you can read manually. The gap is between having the data and knowing what to do with it at scale. Signature-based detection tells you when something matches a known bad pattern. It doesn't tell you anything about the thousands of connections that don't match any signature but are still wrong.

That's the problem machine learning is actually useful for in a security context, and it's what this talk at SuriCon 2017 was about.

What Botnets Look Like in the Data

Botnet command and control traffic has structural properties that are hard to hide. C2 clients check in periodically, which produces beaconing — connections to the same destination at regular intervals, often with similar payload sizes and connection durations. No human generates that pattern. An analyst looking at one hour of logs might not see it. A model trained to detect periodic behavior in connection frequency will.

The problem is that beaconing is easy to describe and hard to threshold. You can't write a Suricata rule that says "alert if any host contacts the same external IP more than N times per hour on the same port at intervals of roughly M minutes." That rule would fire on legitimate software update checks, NTP sync, and half the monitoring tools in your environment. The pattern needs context — what else is that host doing, what's the volume, does the timing drift or is it mechanical — and that context is what machine learning can encode.

Detecting Data Exfiltration

Data exfiltration shows up differently than C2. The signature is volume and direction. An internal host pushing sustained outbound traffic to an unfamiliar external destination, especially at unusual hours or to destinations with no historical connection baseline, is the pattern. Individually, any one of those signals is noise. Together, they're a finding.

Splunk's statistical functions handle this well enough for a first pass: timechart to see upload volume trends, streamstats to compute rolling averages per host, eval to calculate the ratio of bytes out to bytes in per connection. When those numbers deviate significantly from a host's baseline, something changed. Whether that change is legitimate or adversarial is the analyst's call — but surfacing the deviation automatically is the ML step.

Port and Traffic Analysis

Scanning shows up as a specific shape in connection data: many destination IPs or ports from a single source, with very short connection durations and high rates of resets or failures. Suricata's flow records capture all of this. The issue is that "high rate of resets" is relative to the environment. What counts as a scan in a small office looks like background noise in an enterprise network perimeter.

Clustering is useful here. Group hosts by their connection behavior features — unique destinations per hour, port spread, success rate, average bytes per flow — and look for hosts that cluster apart from everything else. An isolated cluster of one or two hosts with unusual port spread and low success rates is worth investigating regardless of whether any signature fired.

Making ML Work in a Production SOC

The operational friction with machine learning in a SOC is real. Models need to be trained on data that reflects your environment, which means you can't just download a pre-trained model and deploy it. False positives erode analyst trust faster than missed detections do. And every model needs a feedback loop — a way to tell it which findings were real and which weren't — or it drifts out of calibration over time.

The Splunk Machine Learning Toolkit reduces some of this friction. The assistants walk through clustering, anomaly detection, and classification without requiring SPL expertise for the ML steps themselves. Training a K-means model against Suricata flow data, applying that model to new data, and building an alert when a host falls into an anomalous cluster can be done entirely within the Splunk UI. That doesn't solve the calibration problem, but it removes the barrier between a security analyst who understands the problem and a deployed detection.

The workflow I covered in this talk was specifically about botnet hunting: start with Suricata flow data, engineer features around connection periodicity and volume ratios, cluster to find behavioral outliers, and use those clusters as the input to a classification model that predicts whether a host is engaging in C2 activity. The results won't catch everything. But they will surface activity that no signature in the Emerging Threats ruleset would have flagged.

Download Slides


Presented at SuriCon 2017