How a GitHub Archive of BASHLITE Ended Up in Academic Research
In 2015, while working as a security data scientist at Splunk, I archived the source code for the BASHLITE botnet (also known as Gafgyt, LizardStresser, and Torlus) in a public GitHub repository. The intent was to make the source available for research, detection engineering, and indicator of compromise development. BASHLITE was one of the first major IoT botnets, predating Mirai by over a year, and having the source available was critical for understanding how these families worked at a code level — not just observing their traffic, but reading the logic that generated it.
A decade later, that repository has been cited in peer-reviewed academic research as a primary source for BASHLITE malware analysis and detection methodology.
Why Source Code Matters for Detection
Most malware analysis focuses on behavioral observation: traffic patterns, command and control protocols, payload signatures. Source code analysis is a different discipline entirely. When you can read the actual code, you can identify detection opportunities that behavioral analysis alone would miss.
Mirai is the canonical example. Its scanner.c module contained a hardcoded table of 62 default credentials used to brute-force IoT devices — combinations like root:xc3511 and admin:7ujMko0admin targeting specific DVRs, routers, and IP cameras. That credential table wasn't just a list of passwords. It was a fingerprint. Any scanner hitting a honeypot with that exact sequence of credential pairs, in that order, was almost certainly a Mirai variant. You could build high-confidence detection rules from the source that would have been much harder to derive from traffic captures alone.
BASHLITE had its own patterns. The gayfgt identifier string, the specific implementation of its C2 protocol, the way it structured its scanner and payload delivery — all of these were artifacts visible in the source that became the basis for fingerprinting and classification in the research that cited the repository.
The value of publishing source code for research is that it lets the defensive community build detection methods grounded in what the malware actually does, not just what it appears to do on the wire.
A Note on Liability and Responsible Disclosure
Publishing malware source code is not without risk, and it shouldn't be done casually. There is real legal exposure. Hosting functional malware — even for research — can create liability depending on jurisdiction, and there's always the question of whether you're providing a resource for defenders or a toolkit for attackers.
The security research community has developed norms around this. Sites like Malware Traffic Analysis make packet captures and samples available, but distribute them encrypted with published passwords specifically to prevent accidental execution and to establish clear intent. Academic malware repositories require institutional affiliation and agreements. Even VirusTotal gates access behind API keys and terms of service.
The BASHLITE archive was source code, not compiled binaries, which makes the liability calculus different — you can't accidentally run a C file — but the intent still matters. The repository was created under the context of active security research at Splunk, where we were building Suricata rules and ML models against botnet traffic. The code was the reference material for that detection work. The repository is now offline, and the code itself was never secret — it had been circulating on paste sites and forums for years before I archived it. What the GitHub repository provided was a stable, citable location that researchers could reference in published work, which is exactly how it was used.
If you're considering publishing malware samples or source code for research purposes, think carefully about:
- Jurisdiction: Laws like the CFAA (US) and the Computer Misuse Act (UK) have broad language. Research intent is a defense, but not an automatic safe harbor.
- Context: Are you affiliated with an institution or company that provides a research justification? Is the material already publicly available elsewhere?
- Format: Source code, encrypted archives, and behavioral indicators carry different risk profiles than ready-to-deploy binaries.
- Access controls: Consider whether gating access (even loosely) better serves the research community than fully open distribution.
The Citations
SourceFinder: Finding Malware Source-Code from Publicly Available Repositories
Authors: Md Omar Faruk Rokon, Risul Islam, Ahmad Darki, Vagelis E. Papalexakis, Michalis Faloutsos Venue: RAID 2020 (23rd International Symposium on Research in Attacks, Intrusions and Defenses), published by USENIX Citations: ~69
SourceFinder developed a systematic approach to identifying malware source code across public repositories. The researchers built classifiers to distinguish malicious repositories from benign ones at scale. In their analysis of IoT malware trends on GitHub, they specifically noted:
"Interestingly, the source code of the original BASHLITE botnet is available in a repository created by anthonygtellez in 2015."
The repository was used as a ground-truth data point in understanding the landscape of malware source code on public platforms — how it gets there, who uploads it, and whether it's identifiable through automated analysis. SourceFinder was later incorporated into the lead author's PhD dissertation at UC Riverside.
Improving IoT Botnet Investigation Using an Adaptive Network Layer
Authors: Joao Marcelo Ceron, Klaus Steding-Jessen, Cristine Hoepers, Lisandro Zambenedetti Granville, Cintia Borges Margi Venue: Sensors (MDPI), Volume 19, Issue 3, 2019 Citations: ~68
Ceron et al. proposed an adaptive network layer for IoT botnet investigation. Their approach required deep analysis of how botnet families structure their C2 communications, and they used the BASHLITE source from my repository as one of two primary references (alongside Mirai) for that analysis:
Tellez A. The Malware Bashlite Source Code. Available online: https://github.com/anthonygtellez/BASHLITE/
This is the kind of citation that demonstrates the detection engineering value directly. The researchers weren't just acknowledging the code existed — they were reading the C2 protocol implementation to build their adaptive detection framework. Without access to the source, that work would have relied entirely on reverse engineering compiled binaries or inferring protocol structure from traffic, both significantly more labor-intensive approaches.
Techniques for Detecting Compromised IoT Devices
Authors: Ivo van der Elzen, Jeroen van Heugten Venue: University of Amsterdam, System & Network Engineering (OS3) Master Research Project, 2017 Citations: ~48
This research project developed fingerprinting techniques for identifying compromised IoT devices running Mirai and BASHLITE. The authors cited the repository as:
Anthony Tellez. An archive of BASHLITE source code. https://github.com/anthonygtellez/BASHLITE
They went deep into the code. Figure 16 of the paper shows the BASHLITE gayfgt signature string in octal representation, extracted directly from the archived source, used as a fingerprint for identifying infected devices. This is source-code-driven detection at its most direct: a unique string from the binary's construction becomes the basis for a network-level identification rule.
Impact
These aren't marginal papers. The SourceFinder and Ceron papers have a combined ~137 citations and appeared at USENIX RAID and in MDPI Sensors, both well-regarded venues in the security research community. The van der Elzen report has been cited nearly 50 times as a practical reference for IoT malware detection. The combined citation count of the papers referencing the repository exceeds 180.
More concretely, the research that cited this archive directly advanced IoT botnet detection: automated identification of malware repositories (SourceFinder), adaptive C2 protocol analysis (Ceron), and device-level compromise fingerprinting (van der Elzen). Each of these represents a different layer of the detection stack, and each used the source code as a primary reference rather than relying solely on behavioral observation.
The BASHLITE repository itself is now offline, but forks survive on GitHub and the citations are permanent in the academic record. The original archive served its purpose: it gave researchers a stable, citable source for one of the foundational IoT botnets at a time when that code was scattered across paste sites and underground forums with no persistent, referenceable location.
Context
This work was part of a broader effort during my time at Splunk, where I focused on applying data science and machine learning to network security. The botnet research fed directly into conference talks at SuriCon (Hunting BotNets, Malware Analysis) and ultimately into a patent for ML-based malware domain detection (US11843622B1), which uses pre-trained models to classify domain names associated with botnets, ransomware, and other threats based on DNS log data.
The thread connecting all of this is the same: source code and raw data, made available for analysis, enable better detection. The BASHLITE archive was one node in a pipeline that ran from raw malware source through Suricata rule development, conference presentations, ML model training, and eventually a granted patent. Each step built on having access to the underlying material, not just the abstracted indicators.
Related Articles
Machine Learning for Malware Domain Detection: Lessons from a Patent
How we built machine learning models to classify malicious domains at Splunk scale, covering the feature engineering, model architecture tradeoffs, production challenges, and what made the approach novel enough to patent.
SuriCon 2017 - Hunting BotNets: Suricata Advanced Security Analytics
Practical machine learning techniques for botnet detection using Suricata data, covering data exfiltration, traffic analysis, and advanced threat detection workflows.
Analyzing BotNets with Suricata & Machine Learning
Using Splunk's Machine Learning Toolkit and Suricata data to analyze and predict Mirai botnet activity through K-means clustering and Random Forest classification.