Supercharging Security with RAG: SuriCon 2024

Anthony G. Tellez9 min read
RAGSecuritySuricataAILLMSplunkSuriConConferenceGraphistryOpenAIRule Management2024

In November 2024, I co-presented with Leo Meyerovich, CEO of Graphistry, at SuriCon 2024 in Madrid. The talk was called "Supercharging Security with RAG," and it covered work I had been doing at BNY Mellon in partnership with Graphistry: whether AI could materially help with one of security's most persistently painful problems, which is rule management.

Rule management is bad across security products. Security teams are maintaining thousands of detection rules, constantly fighting redundancy, and junior analysts have no efficient path to learn from what already exists. We wanted to know if RAG could change that. This is what we built, demoed, and learned.

Starting with the obvious approach

We started with prompt engineering, because that is where everyone starts. Give GPT-4o a well-structured prompt about how to write Suricata rules, upload examples, ask it to generate new ones.

Your role is to help cyber security analysts make sense of Suricata rules,
create new rules and optimize existing rules to ensure there is no duplication
of alerting or noisy alerts that would overwhelm analysts...

The LLM could generate syntactically correct rules. That was not the problem. The problem was that each session started from scratch. No memory between queries. We had to re-upload context every time. Without knowledge of our existing rule library, the model regularly recreated rules we already had or missed organization-specific patterns entirely. The output was generic in the worst sense — technically fine, practically useless.

The failure mode here is subtle. A model that produces correct-looking output with no grounding in your actual environment is not helping; it is generating plausible noise that still requires a human expert to validate. The expert was doing as much work as before, just with an extra step.

Moving to RAG

The question we asked ourselves: what if we could teach the system about our rules semantically, not just syntactically? That question is the premise of RAG.

The core idea is straightforward. Embed your existing rules as vectors, store them in a vector index, and when a query comes in, retrieve the most semantically relevant rules and pass them as context to the generation model. The model now knows what you already have before it suggests anything new.

Building the embeddings pipeline

Here is the stack we used:

  • OpenAI text-embedding-3-small for embedding, trimmed to 384 dimensions
  • LangChain for batch processing and orchestration
  • Lotus to rewrite rule descriptions using GPT-4o before embedding
  • BigQuery for data refinement and storage
  • VectorAI as the vector index
  • Google Colab for experimentation notebooks
  • Graphistry for visualization and graph-based RAG
  • Louie.ai as the analyst-facing chatbot interface

The embedding function we demoed at SuriCon processed rules in batches of 200, which hit the right balance between cost and throughput:

from langchain.embeddings import OpenAIEmbeddings

def embed_texts_in_batches(
    df: pd.DataFrame,
    text_column: str,
    output_column: str,
    api_key: str,
    model_name: str = "text-embedding-3-small",
    target_dim: int = 384,
    batch_size: int = 200
) -> pd.DataFrame:
    """
    Compute embeddings for Suricata rules in batches,
    trim to 384 dimensions for efficiency.
    """
    embedding_model = OpenAIEmbeddings(api_key=api_key, model=model_name)

    # Process rules in batches of 200
    for i in range(num_batches):
        batch_texts = df[text_column].iloc[start_idx:end_idx].tolist()
        batch_embeddings = compute_and_trim_embeddings(
            batch_texts,
            embedding_model,
            target_dim
        )
        all_embeddings.extend(batch_embeddings)

    df[output_column] = all_embeddings
    return df

The Lotus step mattered more than we initially expected. We used it to rewrite each Suricata rule with a natural language description before embedding it. A rule like:

alert tcp any any -> any 3389 (msg:"RDP Connection"; ...)

became:

"This rule detects Windows Remote Desktop Protocol (RDP)
 connections on port 3389, which could indicate..."

Embedding the enriched description alongside the raw rule gave the vector model semantic signal that a bare rule keyword field simply does not have. This meaningfully improved retrieval quality on conceptual queries.

Querying semantically

Once rules were embedded, we could run k-nearest neighbor search to find semantically similar rules:

def knn_search(query_text: str, k: int):
    """k-NN search in vector index"""
    query_vector = vectorize_texts([query_text])[0]

    search_query = {
        "size": k,
        "query": {
            "knn": {
                "vector": query_vector,
                "k": k
            }
        }
    }

    response = requests.post(
        f"{OPENSEARCH_HOST}/{INDEX_NAME}/_search",
        json=search_query
    )
    return response.json()['hits']['hits']

That enabled four things that prompt engineering alone could not do. Analysts could search by intent — "show me rules for lateral movement" — and get back RDP, WMI, and PSExec rules rather than keyword matches. Rules with similarity above 0.95 were flagging as likely redundant. Sparse regions in the embedding space revealed coverage gaps. And the retrieved rules gave the LLM concrete, organization-specific examples to work from when generating new ones.

The Graphistry layer

This is where the partnership with Graphistry added something qualitatively different from standard RAG.

Traditional RAG retrieves the top-k most similar documents. That is useful, but security data has relationships that a flat similarity ranking does not capture. Rules fire on related protocols. Threats target related assets. Incidents involve clusters of indicators that share structure even when they do not share keywords.

Graphistry built a graph-based RAG on top of the vector index: rules as nodes, similarity scores as edges, structured as a graph that analysts can query and explore. Rather than getting a ranked list of five similar rules, you get the cluster those rules belong to and can navigate the edges to related clusters.

The Louie.ai demo

At SuriCon we demoed this through Louie.ai, Graphistry's chatbot interface for this architecture. The demo flow was intentionally concrete:

Analyst: "Hey Louie, get 100 events to do with SSH traffic in bnym-rag-index-d384"

Louie: [Retrieves events from vector index, shows similar patterns]

Analyst: "Great, now do a umap on column rule"

Louie: [Creates UMAP visualization in Graphistry showing rule clusters]

The UMAP visualization showed rules clustered by semantic similarity. Tight clusters flagged redundant rules. Outliers surfaced rules that did not fit any pattern. Sparse regions of the embedding space showed where coverage was thin. Seeing those clusters convinced skeptical security teams in a way that a spreadsheet comparison would not. That was the point.

Watch the full demo

What this actually enables

Three specific use cases drove the work:

Alert comprehension for junior analysts. When a junior analyst sees "ALERT: SURICATA App Layer Protocol tcp," the current answer is often to ask someone who knows. With RAG, the system retrieves similar historical alerts and relevant documentation and gives a grounded explanation of what the alert usually means, what related incidents have looked like in the past, and what the recommended response steps are based on prior decisions. The answer is still coming from a model, but the evidence behind it is coming from your own history.

Rule generation with existing coverage in view. Before RAG, an analyst writing a rule for Windows Remote Desktop connections had to manually check whether similar rules already existed. With RAG, the system retrieves existing RDP-related rules before generating anything, references them explicitly by ID in the output, and produces a new rule that complements rather than duplicates existing coverage.

Continuous enrichment pipeline. The longer-term workflow we described was automated: ingest new threat intel and logs, embed and index them automatically, and surface recommendations via RAG queries as analysts investigate. The manual knowledge management burden shrinks because the index stays current without requiring someone to curate it.

What the scaling actually looked like

One of the more concrete findings from the research was how the compute requirements changed under optimization.

We tested the approach on a dataset of 1 billion vectors covering threat intel, OSINT, and logs. Naive implementation required 12,288 GB of RAM. After three optimizations — shorter vector length (384 instead of 1,536 dimensions), quantization from float32 to float16, and keyword tuning to remove high-volume low-signal data — the requirement dropped to 128 GB. A 100x reduction.

For inference, we tested self-hosted embedding using a Triton inference server, which delivered about 20 sentences per CPU per second. The point was not that this configuration is right for everyone; it is that production RAG does not require massive infrastructure if you engineer the storage and retrieval thoughtfully.

What we learned

Semantic enrichment before embedding was the decision with the highest leverage. The Lotus rewrite step made the vectors reflect what rules detect rather than just how they are written. The k-NN queries became meaningfully more useful after that change.

Context in the prompt mattered more than model size. GPT-4o with retrieved relevant examples outperformed a larger model without them. The retrieval quality is the variable that controls output quality at the generation step; the model is mostly interpolating across what you hand it.

Prompt personas affected output in ways that were easy to underestimate. Framing the system role as "detection engineer" versus leaving it generic produced noticeably different rule quality and explanation style. This is not a trick; it is the model operating with a more constrained prior about what good output looks like.

Graph visualization was operationally important, not just aesthetically useful. The UMAP clusters made rule relationships tangible to people who were not going to read a similarity matrix. That tangibility is what moved the conversation from "interesting research" to "how do we run this in production."

This was research and exploration, not a production deployment. But it proved the concept in the direction that matters: RAG can change how security teams interact with their existing knowledge, and the optimization path to production scale is tractable.

Resources

Co-presented with Leo Meyerovich, CEO of Graphistry (LinkedIn) and Anthony G. Tellez, SVP at BNY (LinkedIn).


This post reflects research and exploration conducted in my role at BNY Mellon in partnership with Graphistry. All technical details shared are based on publicly available information from the SuriCon 2024 presentation.