Building an Operational Machine Learning Organization from Zero
When I joined BlockFi in April 2021 as the first Head of Machine Learning, there was no ML organization to speak of. There was data, there were smart people, and there were problems worth solving with machine learning. What there was not: a team, a stack, a process, or a roadmap. This post, adapted from a presentation at Databricks Data & AI Summit 2022, covers what that build looked like — the organizational questions, the technology choices, and the two substantive problem domains where ML ended up mattering most: blockchain security and platform stability.
Our Journey with Databricks
Building a Cross-Functional ML Team
The first question was never about models. It was about who was going to build them and what they were going to work on. Recruiting data scientists and ML engineers into a crypto fintech in 2021 was competitive, and building a cross-functional team meant solving for more than headcount. We needed collaboration frameworks that connected technical work to business outcomes, shared goals across teams that historically operated independently, and a way to bridge the gap between the people who understood the data and the people who owned the problems.
Getting that bridge right took longer than any individual hiring decision. A data scientist who cannot communicate with a compliance or operations stakeholder is only marginally more useful than no data scientist. A business team that thinks ML is magic, or a black box they hand problems to, will scope projects badly and lose confidence the first time a model underperforms. Building the org meant building those working relationships as explicitly as we built the technical infrastructure.
Scoping Business Problems for Executive Buy-In
The honest reality of building ML from zero inside an existing company is that you are always working on two parallel tracks: doing actual ML work and justifying the continued investment in doing actual ML work. Executive buy-in is not a one-time gate you pass through. It is an ongoing credibility-building exercise.
What worked was identifying high-impact use cases early, quantifying potential ROI in terms that leadership already cared about, and then demonstrating quick wins before proposing anything with a long time horizon. Credibility compounds. The first model you deploy and can point to — even a simple baseline that shows measurable lift — makes the conversation about the second project substantially easier. The teams that struggle with executive buy-in are usually trying to run the strategic vision conversation before they have proof points. The order matters.
Conveying a Strategic Vision
Once there was credibility from early deliverables, the conversation about longer-term strategy became tractable. A machine learning roadmap for a financial services company in 2021 required technology stack decisions, build versus buy tradeoffs, and a plan for how ML would integrate with the systems the company already ran. Databricks was central to that stack decision — it gave us a unified analytics platform that handled data engineering, model training, and orchestration in one place rather than stitching together a custom infrastructure. That choice simplified every subsequent infrastructure conversation because teams shared a common environment rather than maintaining parallel pipelines.
The strategic vision also had to account for what we were not going to build. Build versus buy decisions in ML are not just about cost; they are about where your team's time goes. Buying a solution for a commodity problem frees engineers to work on the parts of the problem that are actually specific to your business.
Operationalizing ML & Data Science
A model that runs in a notebook is not an operational capability. Getting to production means MLOps infrastructure: model lifecycle management, versioning, reproducibility, monitoring, and the workflows that allow models to improve over time without manual intervention for every update.
At BlockFi, operationalizing ML required thinking carefully about monitoring and observability. Financial services ML has specific requirements around auditability that pure accuracy metrics do not capture. A fraud model that degrades gradually without anyone noticing is a significant business risk. Continuous improvement workflows needed to be paired with the ability to detect when improvement had stopped or reversed. The output of the ML organization was not models — it was maintained models, which is a different and more demanding thing.
Building Clear Business Objectives
The unglamorous truth about applied ML is that most of the work is not algorithmic. It is figuring out exactly what you are trying to optimize and whether the thing you are measuring is actually the thing the business cares about. KPI definition, stakeholder management, and measuring business impact matter as much as model selection, often more.
The best ML projects we ran started with a clear business question that we then translated into a measurable target. The worst started from a capability — "we have this model class that works well for X" — and tried to find a business problem to apply it to. The direction of reasoning matters.
Blockchain Analytics for Security
Unique Security Problems in Crypto
Crypto financial services have a security profile that is genuinely different from traditional finance. The 24/7 trading environment means there is no overnight maintenance window, no period where transaction volume drops and your team can breathe. Transactions are irreversible. A fraud loss that you might recover from through chargebacks in a card-based system is simply gone on-chain. OFAC sanctions compliance required us to screen every transaction and every counterparty, with regulatory consequences for missing a sanctioned entity. And cross-chain tracking — following funds that move from Bitcoin to Ethereum to Solana through bridges and mixers — is an analytical problem that most traditional AML tooling was not designed for.
Estimated Costs
The cost framing for blockchain security work was across three categories. Fraud losses were the most direct: funds gone, with no recovery mechanism. Account takeover affected customer trust and retention in ways that are harder to quantify but equally real — a customer who loses funds through account takeover is almost certainly never coming back. And regulatory fines for OFAC violations or AML compliance failures could dwarf the direct fraud losses. The expected value calculation for investing in detection infrastructure was not complicated.
Graph Theory and Blockchain Analysis at Scale
The technical approach that made blockchain analysis tractable at scale was graph analytics. The blockchain is a graph: addresses as nodes, transactions as edges. The question "is this address connected to a sanctioned entity" is a graph reachability problem, not a database query. Running that across tens of millions of addresses and hundreds of millions of transactions required purpose-built tooling.
Nvidia Rapids — specifically cuGraph — gave us GPU-accelerated graph algorithm execution. Connected components, PageRank, and shortest path algorithms that would run for hours on CPU finished in minutes on GPU. Apache Arrow served as the data interchange layer, eliminating the serialization overhead of moving data between the pipeline components. Graphistry handled visual graph exploration for the compliance analysts who needed to trace fund flows manually, not just read risk scores from a table. And Databricks orchestrated the overall pipeline — ingesting on-chain data from multiple blockchains, running the clustering and enrichment jobs, and writing to both the analytical store and the operational graph database.
The practical outcome was that investigations that previously took hours of manual SQL traversal could be completed in minutes. Speed matters in compliance work: regulatory timelines are real, and funds that continue moving while an analyst is mid-investigation are funds that become progressively harder to trace.
Onboarding Business Teams
One decision we made deliberately was to not keep the blockchain analytics capability inside the ML team. Building self-service access for the compliance and operations teams — through collaborative notebooks, training programs, and documented analytical workflows — meant the ML infrastructure multiplied its value beyond what a small ML team could generate alone. Knowledge sharing frameworks are an infrastructure investment, not overhead. The alternative, where every blockchain analysis question gets routed through the ML team, creates a bottleneck that limits the reach of the capability.
Using ML to Improve Platform Stability
Crypto Never Sleeps: 24x7 Trading
The other major problem domain was platform stability. BlockFi's users traded around the clock, across every time zone, without the natural demand troughs that most consumer-facing platforms can lean on for maintenance and scaling operations. Zero downtime requirements in that context are not aspirational. They are a table stake. An outage does not pause trading; it just means trades that should have happened did not, and customers who attempted them will not forget.
Cost of an Outage
The cost framing here was similar to the security work: lost trading revenue as the most direct impact, customer satisfaction degradation that was harder to quantify but visible in retention data, regulatory implications from service disruptions in a licensed financial services context, and competitive disadvantage in a market where users have alternatives. Framing platform stability as a business problem — not just an engineering problem — was what got the capacity forecasting work resourced appropriately.
Forecasting Techniques
Predicting infrastructure load in crypto is harder than predicting it for most consumer platforms because the load is correlated with market events, not just with user behavior patterns. Simple regression established a baseline — a useful sanity check and a fallback for periods where market volatility was low and usage patterns were relatively stable. Prophet handled the time-series forecasting with seasonal decomposition, capturing day-of-week and time-of-day patterns in a way that was interpretable and maintainable by people without deep forecasting expertise. SARIMAX gave us the statistical machinery for more complex seasonal patterns.
Each of those approaches involved real tradeoffs. More accurate models require more data, more computational resources, and more expertise to maintain. Interpretability matters in production: a model that a platform engineer cannot interrogate when it produces an anomalous prediction is a model that will not be trusted, which means it will not be acted on. We made explicit tradeoffs between accuracy, complexity, and interpretability at each stage, and the tradeoff was different for different forecasting time horizons.
Integrating Non-Traditional Indicators
The forecasting work that was specific to crypto, and not just a standard infrastructure capacity problem, was the integration of market signals as leading indicators for platform load. Market volatility predicted traffic spikes. A major price move would drive a surge in user activity within minutes, well before the infrastructure metrics registered the change. Social media sentiment provided some signal. On-chain network metrics — congestion on the Bitcoin or Ethereum networks, gas prices, transaction backlog — correlated with user behavior in ways that conventional infrastructure monitoring did not capture.
Using these non-traditional indicators as features in the forecasting models gave us earlier warning than any purely infrastructure-based approach would have. The intuition is straightforward: users respond to market events, market events are observable before users respond, and infrastructure load follows users. Shortening the observation-to-action loop was the point.
Operationalizing Predictions
A forecast that is not connected to action is just a number. Operationalizing the predictions meant building early warning systems that surfaced actionable signals before problems developed, integrating with observability tooling so platform engineers saw the ML-derived signals alongside their existing metrics, implementing automated alerting with enough lead time to actually respond, and connecting forecasts to capacity planning workflows that could trigger scaling actions in advance of predicted load.
The integration with observability was the piece that required the most organizational work. ML outputs need to earn their place alongside signals that engineers already trust. A new signal in the dashboard that fires alerts no one acts on will be hidden within a week. Building that trust required demonstrating accuracy over time, calibrating thresholds to minimize false positive alerts, and making the basis for each alert understandable — not just "ML says scale up" but "here is the market signal that triggered this and here is what we historically saw follow it."
Presentation Materials
Watch Video (Requires login)
Presented at Databricks Data & AI Summit 2022
Related Articles
Graph Analytics for Blockchain Forensics: Tracing $252M in Suspicious Transactions
How we built a crypto-specific graph analytics framework at BlockFi using Nvidia Rapids, Apache Arrow, Graphistry, and Neo4j to trace and flag hundreds of millions in suspicious blockchain transactions.
How BlockFi Is Using Machine Learning To Take Crypto Safety to the Moon!
Showcasing BlockFi's use of Splunk and machine learning for cryptocurrency security, including anomaly detection, fraud identification, and graph analytics for blockchain analysis.
Rethinking NFT Security: What Standard Tokens Actually Provide
A technical look at the security gaps in standard ERC-721 and ERC-1155 tokens, what attack vectors they leave open, and the architectural approaches a 2022 provisional patent addresses for financial-grade digital asset security.