The LLM Stack for Financial Analytics: Foundation, Small, or Both

PublishedApril 23, 2026ByAlec Vishmidt

(Intro)

AI in financial services has crossed the adoption threshold, but most deployments still stall between pilot and production. The strategic choice between foundation models, small language models, and hybrid architectures that combine them determines which survive their first audit, meet their latency budget, and scale beyond the pilot. This long-read walks through the models available today, the use cases for which each is worth the cost, the technical approaches behind production deployment, and the regulatory and governance work that often determines outcomes more than the architecture choice itself.

The LLM Stack for Financial Analytics: Foundation, Small, or Both ⊹ Blog ⊹ BN Digital — Fig. 0

Executive Summary

As of early 2026, the financial domain stands at a critical point in AI adoption. 72% of organisations now use AI, and 65% regularly deploy generative AI—that’s nearly double from ten months prior. However, most firms are still piloting: only 38% of AI projects in finance meet or exceed ROI expectations, and 60% experience significant implementation delays.

The strategic choice between foundation models and small language models determines whether AI investments drive measurable value or become another line item in the AI budget quietly wound down next year. This decision affects deployment costs, latency, regulatory compliance, and ultimately, competitive advantage.

Foundation models—large pretrained networks with billions or trillions of parameters—excel at complex document analysis, multi-step reasoning, research synthesis, and investment memo generation. They process unstructured data natively, understand context across lengthy documents, and handle tasks that require broad reasoning.

Small language models for financial services (those with under 10 billion parameters) dominate where speed, cost, and control matter most. They are known for cost savings on high-volume sentiment classification, millisecond response times in fraud detection, and on-premise LLM deployment for banking.

With such a dynamic, multifaceted market, the choice is never binary. On the one hand, both options meet the criteria that financial organisations require. On the other hand, most companies need hybrid architectures or at least complex customisations. More and more often, companies are turning to combining small models—fine-tuned for the domain, such as retrieval augmented generation (RAG) for financial services.

Success requires matching computational requirements to business constraints and understanding where architectural differences actually matter. On top of that, there are governance frameworks to build—ones that satisfy regulators demanding transparency, explainability, and audit trails. As a result, business outcomes become central to choosing the right model.

In the financial implementations we have shipped at BN Digital, the architectural decision rarely determines success on its own. The bigger predictor is whether the organisation has done the data and governance work before choosing a model. This article takes that lens.

Introduction: The AI Transformation in Financial Services

AI in financial services is experiencing a reality check. The research shows that the number of organisations using AI jumped from 50% to 72% in 2024. More telling: 65% of companies regularly employed generative AI—that’s nearly double the percentage from ten months earlier.

The gap between those two numbers is the article. Enterprise AI in finance in 2025 hit a plateau but marked a major cultural shift in AI adoption.

Most firms are stuck piloting their AI initiatives. The analysis of 540 financial service leaders found that 46% classified themselves as “pioneers” with high generative AI expertise.

The gap between experimentation and implementation isn’t closing fast enough. Only 22% planned to fully scale 40% or more of projects in experimentation over the next three to six months, compared with nearly half of the pioneers. In other words, if your team hasn’t committed yet, you’re falling further behind every quarter.

Strip away the hype, and language models do three things that are significant for financial analytics:

They process unstructured data at scale.
They automate repetitive analytical work.
They surface insights humans would miss in massive document sets.

Financial institutions are exploring how language models can enhance processes such as analysing financial reports, automating customer service, detecting fraud, and conducting market sentiment analysis. Notice what’s missing from that list: replacing analysts. The technology augments existing work, just faster and with fewer errors.

Organisations deploying the technology already report cost decreases and revenue increases. Still, these gains cluster in specific functions—marketing and sales, product development, IT—where the work is defined enough for AI to handle but complex enough to deliver real value through automation.

Every financial institution now faces a choice that matters more than which vendor to select: foundation models or small language models for your analytics stack. And neither option is evidently correct.

In this article, we examine the strategic and technical dimensions of choosing between foundation models and small language models for financial analytics with AI. We’ll cover:

The particularities of the foundational models available today.
The domain-specific LLMs in financial services.
Use cases for each model type.
Technical approaches to implementation.
Compliance and regulatory considerations.
Balancing emerging trends with strategic decisions.

By the end, you’ll have a decision framework grounded in business outcomes rather than technology features. Because the question isn’t which AI is most impressive—it’s which approach creates measurable value within your constraints.

Understanding the Model Spectrum: Definitions and Architecture

Model terminology creates confusion before the planning work begins. There are three types of solutions with distinct capabilities and cost structures:

Foundation models (FMs);
Large language models (LLMs);
Small language models (SLMs).

While LLMs are more or less familiar thanks to the widespread use of tools such as ChatGPT, Perplexity, and Claude, the other two, foundation and SLMs, tend to introduce some difficulties into the equation.

To begin with, the relationship between these models is hierarchical, not interchangeable. Every LLM is a foundation model, but not every foundation model is an LLM. This distinction affects deployment decisions.

Foundation models are large pre-trained networks designed to serve multiple tasks and modalities. LLMs are a specialised subset of foundation models trained exclusively on text with language-only objectives and token prediction.

When a model processes graphs alongside transaction records, it's operating as a multimodal foundation model. Pure LLMs can't manage this without additional architectural layers. But the architectural differences run deeper than the types of inputs:

LLMs use decoder-only transformer setups optimised for generative tasks (code generation, report writing, etc.).
Foundation models employ encoder or encoder-decoder structures suited for representation learning and discrimination tasks.

The output format depends entirely on training objectives. Foundational models can return embeddings or segmentation masks. LLMs produce text tokens.

Meanwhile, small language models are at the opposite end of the scale. Microsoft defines SLMs as "compact AI systems"—typically sized at just under 10 billion parameters, making them five to ten times smaller than LLMs. Gartner defines the practical SLM range more precisely. Enterprises evaluating these models for production use are exploring systems with 500 million to 20 billion parameters.

The reduced parameter count creates multiple advantages, including:

Faster inference;
Lower memory requirements;
Reduced computational needs;
Improved data privacy and security;
Easier and cheaper fine-tuning.

Foundational models might have billions or even trillions of parameters. For SLMs, these often range from several to a few hundred million. With fewer parameters to process, SLMs generate responses substantially faster than larger counterparts. Moreover, research shows that smaller models demonstrate similar performance while staying much greener. In practice, it means lower deployment costs and fewer latency constraints.

Training approach differs across model types.

LLMs train on massive text corpora (e.g. Common Crawl, GitHub, and books), creating a rich understanding of syntax, semantics, and context.
Foundation models ingest broader data, such as image-caption pairs, videos with narration, and tabular data from financial systems. The result is more general features, but also more noise and unexpected behaviours.
SLMs can be fine-tuned cost-effectively through repeated sampling to achieve high accuracy in limited domains. It is particularly valuable where high accuracy matters more than broad general knowledge.

LLM versus SLM is not a technology contest. It is a constraint-matching exercise. Finance analytics doesn't always need trillion-parameter models. Sometimes the smaller, faster, cheaper option delivers better ROI—if you know where the architectural differences actually matter.

Domain-Specific Financial LLMs: The Current Landscape

A language model that understands earnings calls is engineered differently from one that writes sonnets. The realisation that each is often powered by distinct capabilities and architectures led to a wave of domain-specific models that rewrote the economics of AI deployment.

Bloomberg made the first serious bet on finance-specific language modelling at scale with the launch of its BloombergGPT in March 2023. It is a 50-billion-parameter model trained on a hybrid corpus combining general and financial data. The training corpus exceeded 700 billion tokens: 363 billion from financial documents pulled from Bloomberg's 40-year archive, combined with 345 billion from public data.

The model uses a decoder-only causal language model architecture, optimised for few-shot learning, text generation, and conversational systems. Reportedly, the custom model delivered much higher performance, faster time-to-market, and outstanding accuracy.

BloombergGPT remains proprietary, but we already have open-source responses. One of those is FinGPT.

Released in 2023, FinGPT demonstrated that finance-specific models don't require Bloomberg-scale budgets. The training cost was under £300 per run. The model fine-tunes open-source base models (LLaMA, ChatGLM) using approximately 50,000 samples and Low-Rank Adaptation techniques.

FinGPT's evaluation focused on classification tasks: sentiment analysis, news headline classification, and named entity recognition. The accuracy scores across some tasks exceed 85%.

The framework provides an end-to-end pipeline for training and deploying FinLLMs. All code and model weights are publicly available. Yet, FinGPT demonstrates performance limitations in generative tasks: its question answering and summarisation capabilities lag behind larger proprietary models.

The two above are the most discussed, but not the only available models.

In August 2024, Open-FinLLMs was introduced—the first open-source multimodal financial language model suite, developed through academic collaboration at The Fin AI. Built on Meta's LLaMA 3 architecture, it addresses a gap that proprietary models couldn't fill: handling text, tables, time-series data, and charts simultaneously.

FinLLaVA represents the multimodal breakthrough. It adds a CLIP vision encoder and trains on 1.43 million image-text instructions covering charts, tables, and financial documents. This makes it the first open-source financial LLM capable of interpreting financial charts and tabular data without first converting them to text.

Another option is FinBERT—the domain-adaptation pioneer that arrived in 2019. It is one of the first finance-adapted language models, predating the current LLM wave. It was built by fine-tuning BERT on financial corpora and established that domain-specific adaptation improves performance on financial text.

FinBERT has been evolving through several releases and versions. These models range around 110 million parameters—substantially smaller than modern LLMs but effective for classification tasks. For example, FinBERT-20 achieved 87% F1 on Financial PhraseBank sentiment analysis, competitive with models 50–100 times its size.

FinMA, another domain-specific AI option, emerged from the PIXIU framework as an instruction-tuned approach to financial language modelling. It now exists in two variants: FinMA-7B and FinMA-30B, both fine-tuned from Meta's LLaMA base models.

The training approach differs from BloombergGPT's from-scratch methodology. This LLM uses instruction tuning on the Financial Instruction Tuning (FIT) dataset, which comprises 136,609 samples covering news classification, named entity recognition, question answering, sentiment analysis, and stock prediction.

Performance shows task-dependent results. Benchmarks for LLM sentiment analysis in trading and news headline classification reached 93.9% and 97.5%, respectively. The weaknesses appear in tasks requiring numerical reasoning or multimodal understanding.

What's interesting is analysing training costs vs. performance.

Cost is where the spectrum becomes concrete. BloombergGPT's 1.3M A100 hours represents one extreme. FinGPT's sub-£300 training runs represent the other. FinMA occupies the middle ground—accessible to research institutions but beyond hobbyist budgets.

Performance scales poorly with cost. FinGPT matches FinMA-30B in sentiment analysis despite a 1,000-fold lower training cost. BloombergGPT's internal benchmarks show clear advantages, but the lack of public evaluation prevents independent verification.

In other words, sentiment analysis and news classification don't require Bloomberg-scale investment. Question answering and complex reasoning still favour larger proprietary models or expensive fine-tuning of open alternatives. This is something to consider for mid-market finance firms.

Use Cases: Where Foundation Models Excel

Foundation models in finance demonstrate clear advantages in four specific areas where traditional analytics fall short. Based on our AI implementation experience, the clearest fit for foundation models is the long tail—the 15–20% of finance work that resists standardisation. For the other 80%, trillion-parameter models are over-engineering, both commercially and operationally.

Complex document analysis.

Banks process thousands of pages daily—SEC filings, earnings transcripts, credit memos, and so on. A survey of 44 institutions found that 27% are piloting generative AI finance for synthesising information in credit decisioning, with similar adoption for credit memo drafting. The appeal is evident: foundational models handle unstructured text natively.

Research analysing 84 studies on LLMs in equity markets confirm they extract contextualised representations from lengthy documents, such as 100+ page 10-Ks and quarterly reports, without separate preprocessing pipelines. Traditional models require manual feature engineering for each document type, but foundation models adapt through prompting alone.

The efficiency gain compounds across document types. Foundation models learn to understand document structure implicitly. This matters a lot in credit analysis, where relevant information is often buried in footnotes, MD&A sections, and supplementary schedules.

Multi-step reasoning and scenario analysis.

Financial analysis often requires connecting disparate pieces of information across time and context. A study on stock market forecasting reveals that LLMs handle temporal dependencies and comparative analysis that overwhelm traditional methods.

They understand causal relationships in financial data and can thus reason through "what-if" scenarios. This matters greatly in credit decisions. When evaluating a borrower, analysts must weigh historical performance, current market conditions, industry trends, and forward-looking indicators. Foundation models process these connections without explicit programming for each relationship.

Foundation models trace these interconnections through a form of financial NLP—natural language reasoning rather than hardcoded decision trees. Hence, they perform better when a question requires synthesis rather than calculation.

Investment memo generation and client communications.

Financial institutions spend significant time drafting reports, explanations, and recommendations. A survey shows that content generation use cases, including credit memo drafting and data assessment, rank among the most piloted applications across institutions.

Junior analysts spend hours formatting investment committee memos—restating analysis in standardised templates, ensuring consistency across sections, and tailoring technical details for different audiences. Foundation models handle these mechanical steps in seconds, leaving analysts to focus on the actual judgment calls.

The same capability extends to client communications. Explaining why a credit line was approved or declined requires translating quantitative analysis into accessible language. Foundation models draft these explanations, maintaining appropriate tone and detail level for different stakeholders. Compliance teams still review. Legal still approves. But the initial draft no longer consumes half a day.

Research synthesis and market intelligence.

Extracting signals from financial news, analyst reports, and social media requires understanding context and sentiment. Research demonstrates that LLMs outperform traditional sentiment analysis methods: ChatGPT achieved 35% better performance than FinBERT in forex sentiment analysis, with 36% higher correlation to actual market returns.

The comprehensive equity markets review found that ChatGPT-4-based trading strategies achieved 650% cumulative returns over 26 months when applied to news sentiment. Banks are already testing this. Institutions explore generative AI for early-warning systems and customer engagement, both of which are heavily dependent on synthesising information from multiple sources.

Traditional keyword-based alerts generate false positives, but foundation models understand which context matters for specific portfolios. They also catch relevant news that doesn't match obvious keywords—regulatory commentary that implies future policy shifts, or management statements that signal strategy changes without explicit announcements.

Use Cases: Where Small Language Models Shine

Small language models solve a different set of problems—ones where speed, cost, and control outweigh the need for broad reasoning capability.

High-volume sentiment classification.

Financial institutions process millions of transactions, news items, and customer interactions daily. Each requires categorisation, risk scoring, or sentiment assessment. Foundation models handle this, but the economics break down at scale.

The calculations make it apparent. A bank processing 100,000 API calls daily with an off-the-shelf LLM spends roughly $3.2 million annually at $0.09 per 1,000 tokens. The same volume processed by a 7-billion-parameter small model like Mistral-7B costs under $15,000 yearly at $0.0004 per 1,000 tokens.

Moreover, small models trained on domain data handle financial terms more accurately than general LLMs. This precision reduces fines, audits, and reputational damage. When errors in contracts, transactions, or risk models cause losses and regulatory penalties, accuracy matters more than versatility.

Named entity recognition (NER) at scale.

Extracting counterparty names, instrument identifiers, regulatory references, and risk factors from documents is mechanical work that doesn't require deep reasoning. And small models excel here.

Consider contract analysis. One large American bank processes roughly 10,000 customer contracts monthly. An SLM uses streamlined drafting, verifies legal clauses, flags risks, runs compliance checks, and automates monitoring through closure—all without cloud round-trips.

The advantage isn't only cost but also control. Small Language Models trained on a bank's own records trace every answer back to a known source. Nothing leaves the firewall, ensuring LLM compliance with GDPR, HIPAA, the EU AI Act, and various financial regulations.

On-device deployment and edge computing.

Latency kills certain applications. Payment authorisation, fraud detection at point-of-sale, customer service chatbots—these need millisecond response times that cloud round-trips can't deliver. So here's one more choice to make: cloud vs on-premise LLM for banking.

Small models with fewer than 14 billion parameters are deployed on-premises or at the edge. They can run on Intel Xeon servers or even high-end CPUs rather than specialised GPUs. For consistent GPU utilisation above 60-70%, on-premises small models save 30-50% compared to cloud LLMs over three years.

Moreover, small models can run locally on phones, tablets, and edge devices—devices with limited memory that can't support the foundation model's reliance on external network resources. Many SLMs operate offline, configured for greater security than cloud-based alternatives. It is perfect for field operations and resource-constrained environments.

Real-time fraud detection and transaction monitoring.

Fraud detection is the clearest illustration of why latency determines architecture. The difference between flagging a suspicious transaction in real-time versus three seconds later is the difference between stopping fraud and explaining losses.

A mid-size lender handling 10,000 loans monthly could train a mortgage-centric small model across the entire value chain—lead generation, loan origination, underwriting, servicing, and closure. The fine-tuned SLM handles intake classification, underwriting triage, document validation, and closure monitoring without round-tripping to a cloud API.

Small Language Models work standalone or integrate with other models, fulfilling complementary roles. Foundation models might handle complex reasoning for underwriting exceptions. Small models process 95% of routine decisions that follow established patterns and can handle tasks through the entire customer lifecycle.

Technical Approaches: Fine-Tuning, RAG, and Hybrid Architectures

When it comes to customising language models for finance, three technical paths dominate: parameter-efficient fine-tuning, retrieval-augmented generation, and hybrid combinations. Each solves different problems.

Parameter-efficient fine-tuning is mainly associated with LoRA and QLoRA financial models.

Low-Rank Adaptation (LoRA) updates a tiny fraction of model parameters—typically under 1%—while preserving pre-trained knowledge. Instead of retraining all weights in a neural network, LoRA freezes the base model and trains two smaller matrices that approximate weight updates.

LoRA fine-tuning can cost £12-15 using four NVIDIA A5000 GPUs over 14-16 hours—considerably cheaper and resource-efficient compared to foundation models or BloombergGPT.

Quantised LoRA (QLoRA) reduces memory requirements further by loading the frozen base model in 4-bit precision rather than 8-bit.

Parameter-efficient fine-tuning of LLM for finance works when you have:

Domain-specific datasets.
Repetitive tasks that require consistent terminology.
Control over model behaviour and auditability.
Need to meet specific accuracy thresholds.

It falls short when the knowledge base changes frequently or when responses must reflect real-time data.

Retrieval-augmented generation (RAG) for financial data addresses LoRA's blind spots: static knowledge and cutoffs in training data. Foundation models freeze their knowledge during training. In finance, that's problematic. Regulations change. Market conditions shift. Q3 earnings from last week don't exist in a model trained six months ago.

RAG connects models to live data sources during inference. When a question arrives, the system retrieves relevant documents from a knowledge base, passes them into the model's context window, and generates a response grounded in current information.

RAG in finance works when you need:

Access to documents not in training data;
Source attribution and explainability;
Answers grounded in specific, verifiable text;
Lower hallucination rates than pure generation.

Yet, it struggles when:

Queries require calculations or counting;
Answers demand synthesis across many document sections;
Response time must be measured in milliseconds (retrieval adds latency);
Questions need multi-step reasoning.

Optimal performance is achievable by combining SLMs with RAG. A domain-specific, low-latency small model works as the generator while RAG fetches real-time compliance rules.

With this, a global bank might fine-tune a small model on mortgage documentation to understand financial terminology, then augment it with RAG to access current interest rates and regulatory guidelines. A mortgage SLM trained on proprietary underwriting data handles the specialised language. RAG supplements with country-specific regulations that change too frequently to bake into the model.

The alternative, RAG plus LLM fine-tuning, takes a foundation model, calibrates it using bank-specific contract processes, then adds RAG for customer details from live databases. This approach balances specialised knowledge with dynamic access to information.

The pattern we see most often in mid-market banks looks like this: a fine-tuned 7-billion-parameter model handles classification and extraction on-premise, RAG layers current regulatory content on top, and a foundation model is reserved for the 5–10% of queries that genuinely require multi-document reasoning. The work is in deciding which queries route where, and proving out that routing logic to a compliance auditor.

A cost comparison of small and large language models reveals clear patterns.

For high-volume, narrow tasks, fine-tuned small models work best. Sentiment classification, entity recognition, and document categorisation don't need broad reasoning or current data.
For complex document analysis, RAG demonstrated the best results. Proxy statements, 10-Ks, and regulatory filings contain information that wasn't in training data and require precise extraction with source citations.
For professional-level reasoning, hybrid architectures combine strengths. Pure RAG might retrieve the right sections, but it struggles to apply formula construction logic.

Mid-market institutions usually land here: fine-tune a small model for domain language, layer RAG for current data, deploy on-premises to control costs. Large institutions might justify foundation model APIs for breadth. Small firms with narrow use cases choose pure fine-tuning. But most financial institutions need both capabilities working together.

The Business Case: Cost, Latency, and ROI Considerations

The finance sector faces a particular tension. AI models must produce responses for trading decisions in milliseconds while processing nuanced regulatory language for compliance. These competing demands force companies to weigh trade-offs that directly affect their capabilities.

When it comes to training and inference cost, building a model from scratch remains out of reach for most organisations.

Training a GPT-3-equivalent model requires up to $12 million, and that figure excludes the latest architectures. Bloomberg's domain-specific GPT model consumed between $2.67 million and $10 million in training costs alone, using 1.3 million GPU hours across 53 days.

Pre-trained models shift the cost structure. Fine-tuning and inference become the primary expenses, measured in tokens processed. A mid-sized trading desk with 300 analysts making five daily requests totals $2,835 monthly.

The total cost of ownership calculation breaks along a consistent line in the projects we've worked on: above roughly 50,000 inference calls per day, fine-tuned small models on dedicated infrastructure begin beating API-only foundation models within 12 to 18 months. Below that volume, the economics favour APIs and the operational overhead of self-hosting rarely justifies itself. The threshold shifts with model size and GPU pricing, but the curve's shape remains stable.

The gap narrows when fine-tuning enters the equation. Large models handle general financial queries without additional training. Smaller models require dataset-specific calibration to match accuracy expectations. Still, fine-tuning costs scale with dataset size.

Another aspect is balancing latency requirements and analysis. Speed determines viability in different contexts. Algorithmic trading systems operate on microsecond timescales, but portfolio analysis and compliance reviews operate on seconds.

Large models activate their full parameter set with every query. Mixture-of-Experts architectures attempt to solve this. But even optimised large models rarely beat compact alternatives on raw inference speed.

As for the total cost of ownership, companies need to account for more than API integrations. Infrastructure, governance, and update cycles compound over time.

Large models demand specialised hardware even when accessed via API. Organisations targeting NVIDIA V100 or A100 GPUs spend up to $10,000 per processor. Smaller models run on standard enterprise servers or even edge devices, enabling greater hardware flexibility that reduces costs and shortens deployment timelines. Energy consumption and AI model governance overhead in banking follows the same pattern.

Meanwhile, boards want concrete evidence that AI investments generate returns. So far, the following three metrics are the most relevant and illustrative:

Cost efficiency: compares infrastructure spend, licensing fees, and training expenses between large and small models.
Speed to value: measures time from pilot to production.
Operational impact: tracks changes in manual work, decision turnaround time, and compliance accuracy.

Organisations should also measure governance readiness—the number of compliance checks passed and audit readiness scores. Quantifying risk reduction strengthens the business case when presenting to risk committees or external auditors.

Regulatory Compliance, Privacy, and Risk Management

Innovation is outpacing regulation. The financial sector operates under strict regulatory frameworks that predate AI by decades. Hence, banks can't simply deploy a model and iterate based on user feedback.

Most AI projects in finance do not fail on the modelling. They fail on the governance work that surrounds it. Regulators do not care how impressive the architecture is—they care whether a decision can be reconstructed, attributed to a specific model version, and traced back to the person who approved the change that produced it. When we build custom AI solutions, we treat this scaffolding as the first deliverable, not the last. The model follows the governance, not the other way around.

Every system that touches customer data, credit decisions, or trading operations faces scrutiny from multiple regulators that require documentation, explainability, and audit trails. Most AI vendors do not build for this by default.

For example, the EU AI Act classifies financial applications as high-risk systems. It triggers mandatory requirements. To name a few:

Transparency documentation;
Human oversight mechanisms;
Risk management procedures;
Ability to explain individual decisions to regulators.

GDPR adds another layer. Financial institutions must obtain explicit user consent before processing personal data through AI systems. Models can use data only for legitimate, predefined purposes, and organisations must demonstrate data minimisation—collecting no more information than necessary for the stated task.

Violating GDPR can cost up to 4% of global annual revenue. Figures of that size get board attention. For a mid-sized European bank with €2 billion in revenue, a single compliance breach could trigger an €80 million fine before accounting for reputational damage or customer attrition.

Data privacy considerations shape every aspect of the implementation. Data localisation rules determine where customer information can physically reside. In numerous jurisdictions, financial data must remain within national borders.

Public cloud services offer speed and scale. But sending sensitive transaction records or client portfolios to external providers introduces risks that compliance teams can't always accept.

On-premises deployment flips the trade-offs. Organisations maintain complete control over data flows and can prove to regulators that customer information never leaves internal systems. But that control comes with significant infrastructure costs and operational complexity.

Hybrid arrangements attempt to balance these constraints:

High-sensitivity functions—credit underwriting, wealth management, regulatory reporting—run on-premises or in private clouds.
Lower-risk tasks—general market analysis, anonymised trend identification—can use more economical public cloud services.

Financial regulators demand the ability to reconstruct how models reach decisions. That requirement creates immediate friction with large language models in finance, which operate as black boxes that obscure their reasoning processes.

Institutions are shifting toward adaptive governance strategies. These emphasise continuous monitoring and iterative validation post-deployment rather than one-time pre-launch approval processes.

Organisations must maintain records showing which dataset version trained each model iteration, which prompts generated which outputs, and who approved each change—for both internal oversight and external audits.

LLMs produce plausible-but-wrong output—hallucinations—at non-zero rates. In finance, a single fabricated risk factor or invented regulation can propagate into real losses. To address potential LLM hallucination risk in finance, organisations must:

Train models on relevant data;
Anchor outputs to verified source documents;
Control the noise in training data;
Run regular bias audits.

Organisations should establish formal incident response plans defining escalation pathways, isolation protocols, and rollback strategies. When a model produces unreliable outputs, teams need predetermined procedures to contain the damage quickly.

Emerging Trends: Agentic AI and Autonomous Systems

AI is moving from prompted response to autonomous action. This transition from assistive tools to autonomous agents changes how banks think about automation, governance, and competitive advantage. It is predicted that by 2028, 33% of enterprise software will include agentic AI, up from less than 1% in 2024.

Agentic AI doesn't retrieve data the way traditional AI systems do. It doesn't wait for instructions—it plans multi-step tasks, adapts based on real-time information, and coordinates across functions with minimal human oversight. Agentic AI rewrites operational flows in trading, compliance monitoring, and portfolio rebalancing.

Agentic AI is autonomous, adaptable, and well-coordinated. These particularities enable it to focus on high-value analytics rather than routine queries. Some implementations show that AI agents can take on 60% more tasks while reducing the completion time by 30%.

Another shift in progress is one that considers architecture. Single-agent systems handle discrete tasks. Multi-agent architectures coordinate across several specialised subsystems, each focused on its domain—market analysis, risk calculation, transaction execution. They are working together to solve complex problems that no individual agent could manage alone.

There are three main architectural patterns for multi-agent systems in finance.

The workflow pattern uses sequential processing where each agent completes its task before passing results to the next.
The swarm pattern enables collaborative reasoning through distributed information sharing.
The graph pattern employs hierarchical structures where a coordinator agent delegates tasks to specialised subordinates.

58% of leading CEOs expect AI to have a transformative impact on security and risk management. Autonomous fraud detection is among the highest-value applications for agentic systems, and so is compliance monitoring. In both cases, AI agents help maintain the overview of data that would be impossible for a human brain to process and match.

Finally, the role of SLMs in agent architectures becomes clearer. They serve as the execution layer—narrowly focused agents designed for specific tasks, with faster response times and lower resource requirements. This division of labour allows for scaling agent deployments without proportional increases in infrastructure costs.

Strategic Decision Framework: Choosing the Right Model

Finance leaders face a practical question: which model architecture fits the problem we're trying to solve? Getting this decision right determines whether AI delivers measurable value or becomes another expensive experiment. A structured AI implementation strategy matters more than any single model choice.

The decision matrix for model selection should include four factors:

Task complexity—it determines baseline requirements. Simple repetitive tasks don't require trillion-parameter models. Yet, domain specificity matters. Financial institutions need models that understand sector jargon, regulatory language, and market terminology. General-purpose models trained on broad internet datasets lack this depth until extensively fine-tuned.
Latency requirements—they shape architecture choices. Trading systems operating on microsecond timelines can't wait for multi-second inference from large models. Meanwhile, compliance workflows can tolerate slightly longer processing times if accuracy improves.
Data privacy requirements—they often override performance considerations. Sometimes, on-premises deployments are mandatory regardless of model size, even for cases where cloud-based large models would perform better.
Cost structure—it differs significantly between architectures. Everything depends on the setup, fine-tuning, and long-term goals.

Single-model strategies rarely optimise for all constraints. Hybrid approaches work better for companies that seek a maximum level of optimisation (costs vs quality) or want to stay model-agnostic. Both goals lead organisations to select the best language model for financial analysis based on specific task requirements.

When it comes to build vs buy considerations for LLM in finance, the decision extends beyond just cost comparisons. Organisations must evaluate capabilities, timelines, control requirements, and long-term flexibility.

Building custom solutions—whether training models from scratch or extensively fine-tuning open-source alternatives—requires significant internal expertise. Buying commercial solutions—either through API access to hosted models or licensed software—shifts responsibilities to vendors.

The vendor landscape currently splits into several categories, each serving different needs and deployment models.

Foundation model providers—OpenAI, Anthropic, Google, Meta—offer access to frontier large models through APIs or open-source releases. They invest heavily in research and training for the base model. Financial institutions use these platforms for broad reasoning tasks, complex analytics, and applications where generalisation matters more than specialisation.

Specialised financial AI vendors build domain-specific solutions on top of foundation models, but this comes at additional expense, often beyond the reach of the majority.

Open-source communities provide alternatives to commercial offerings. Models like Llama, Mistral, Phi, Gemma, and IBM Granite give companies the flexibility to deploy on-premises, fine-tune extensively, and avoid vendor dependencies. These are practical options for companies with strong internal technical teams.

Microsoft's Phi-3 Mini exemplifies the specialised SLM category. Platform providers like Domino Data Lab and Hugging Face, as well as cloud infrastructure vendors, offer tools that simplify model deployment, monitoring, and lifecycle management.

Given all of the above, the decision isn't binary—large versus small, build versus buy. It is how those options combine.

Use case	Daily volume	Latency requirement	Data sensitivity	Recommended architecture	Rationale
Real-time fraud scoring	High (millions)	Sub-100 ms	High	Fine-tuned SLM, on-premise	Latency and volume rule out external APIs; transaction data cannot leave the firewall.
Transaction categorisation	Very high	Seconds	Medium	Fine-tuned SLM, on-premise or private cloud	Deterministic and repetitive; API token costs become prohibitive at scale.
Customer service chatbot	High	Under 2 s	Medium	Fine-tuned SLM + RAG	Needs freshness from product and policy docs; depth of reasoning is rarely the bottleneck.
Internal document search	Medium	Seconds	High	SLM + RAG, on-premise	Sensitivity precludes public APIs; RAG outperforms pure generation for retrieval-heavy tasks.
Sentiment and news analysis	High	Seconds	Low	Fine-tuned SLM or foundation model API	Public inputs and low sensitivity let economics drive the choice.
Credit memo drafting	Medium	Seconds	High	Foundation model + RAG, private deployment	Synthesis across heterogeneous documents; sensitive inputs argue against public APIs.
Regulatory filing synthesis	Low (hundreds)	Minutes acceptable	High	Foundation model + RAG, hybrid	Depth of reasoning justifies the cost; low volume keeps the bill manageable.
Investment memo generation	Low	Minutes	Medium to high	Foundation model + RAG	Heavy synthesis, low volume—a classic foundation-model use case.
Algorithmic trading signal generation	Very high	Microseconds	High	Specialised ML + small classifier	LLMs rarely meet the latency budget; this is a place where not using an LLM is often the right answer.

Conclusion and Future Outlook

The choice between foundation and small language models isn't about picking superior technology. It's about matching capabilities to constraints. To handle this, companies need to consider a multitude of factors. Task complexity, latency requirements, data privacy mandates, and cost structures—each pulls organisations toward different architectures.

Meanwhile, successful implementations demonstrate three patterns:

Simple repetitive tasks requiring precision favour fine-tuned on-premise SLMs.
Complex reasoning across diverse data sources justifies the use of foundation model APIs.
Most production environments need hybrid approaches to balance investments and outcomes.

The model landscape is converging. Foundation models become more efficient through techniques such as mixture-of-experts architectures. Small models gain capabilities through better training data and parameter-efficient fine-tuning for finance. The performance gap narrows. The deployment options expand.

Organisations should prepare for the next wave of shifts in financial AI: agentic AI, maturing regulatory frameworks, and a new approach to build-versus-buy calculations and open-source vs proprietary financial LLMs, as the gap keeps narrowing.

The firms shipping LLMs successfully in finance share three habits. They treat data readiness as a prerequisite, not a parallel workstream. They calibrate human oversight to the consequence of the decision, not the novelty of the technology. And they pick the smallest model that clears the accuracy bar, not the largest one that impresses the board. Cost, latency, and compliance posture all fall out of those three choices—which is why, in the work we do at BN Digital, we tend to argue them first and discuss models second. The question is not whether to adopt LLMs in finance. Most organisations already have. The question is which architecture survives its first audit, holds its latency budget, and scales beyond the pilot. That is an engineering discipline question, not a model selection one.

[✳]

Slava Tarasov
AI for Compliance: A Practical Guide for Teams Tired of Manual Reviews
AI/ML•AI Strategy•LLM Integration
Alec Vishmidt
Usage and Cost Optimisation for Foundation Models
AI/ML•LLM Integration•AI Strategy
Slava Tarasov
AI for Compliance Monitoring vs. Traditional Rule-Based Systems: What Changes
AI/ML•AI Strategy•LLM Integration