AI cost optimization for financial data analysis

Key results

  • Future-proof infrastructure allowing seamless model updates and data integration
  • Institutional-grade responsiveness through automated regional failover and load balancing

About the Client

Bridgewise is a global leader in financial intelligence, providing automated fundamental analysis and investment insights for global stocks. The platform partners with financial institutions to offer multilingual equity and fund analytics, making institutional-grade research accessible to both professional advisors and investors worldwide.

Business Challenge

The primary challenge was managing the cost and latency of a system processing millions of requests per minute. Each request required LLM-based processing, making infrastructure costs highly sensitive to traffic spikes. Relying on a single premium model provider created risks around rate limits, regional latency, and rising operational expenses.

The objective was to build a production system that could handle high-throughput data ingestion while keeping costs predictable and performance stable during peak demand.

Solution Overview

The system was built from scratch on AWS, following production-grade architecture patterns and ISO 27001 requirements. To manage the high volume of hundreds of gigabytes of daily data, we implemented a dynamic model routing that selects the model, region, and provider for each task, prioritizing cost governance and infrastructure cost control.

The architecture utilizes a diverse set of models from various providers. Rather than using high-cost models for all operations, the system selects models based on the specific pipeline stage such as generation, translation, internal routing, etc.

Whenever quality allows, workloads are moved to open-source models to optimize for speed and price.

To maintain stability under peak load, we implemented several technical mechanisms:

  • Fallback Cascades: If a primary model becomes unavailable or exceeds its rate limit, the system automatically redirects requests to secondary models to maintain service continuity.
  • Tiered Execution (Live Calls vs Batch Inference): The system prioritizes tasks by urgency, utilizing live calls for real-time advisor queries and offloading high-volume, non-latency-sensitive workloads to batch inference, providing a 50% cost reduction with higher throughput quotas, ensuring that massive data processing tasks do not compete for resources with live user sessions.
  • Live Benchmarking & Regional Failover: A dedicated module that monitors regional performance and execution economics. Traffic is routed to the fastest-responding region, and the system automatically switches regions if quotas are exceeded or latency spikes occur.
  • Continuous Cost Controlling and Optimization Processes: The system includes ongoing processes to evaluate token consumption and regional pricing, keeping the system cost-efficient as traffic patterns evolve.
  • Model Lifecycle Management: To ensure technical longevity and system reliability, we implemented an algorithm to manage LLM deprecation. As new models are released, an automated algorithm evaluates their performance to update the routing logic and deprecate inefficient models.

The architecture also supports hybrid infrastructure options, including dedicated compute and alternative model-hosting solutions. Strategic self-hosting is planned as the next way to reduce external dependencies and lower operational costs.

Value Delivered

The architecture allows the platform to handle millions of requests and process hundreds of gigabytes of data daily without service degradation. By implementing a dynamic, governed system, we achieved the following results:

  • Sub-linear Cost Scaling: Our continuous cost control and optimization processes decouple infrastructure costs from system load. Currently, a 2x increase in user requests results in only a 40% increase in infrastructure costs, rather than a linear doubling of expenses.
  • Operational Efficiency: In high-volume tasks, switching to open-source models provided savings of approximately 2-10x compared to premium model alternatives.
  • Predictable Performance: The system remains stable during peak demand by shifting traffic across regions and providers, ensuring consistent, low-latency responsiveness under peak load.
  • Future-Proof Governance: With LLM deprecation management and cost governance, the client can integrate new data sources or update model versions without redesigning the infrastructure.

KPIs

  • 2-10x
    savings on high-volume tasks through open-source model optimization
  • 50%
    cost reduction for batch processing workloads

Share this post:

  • #AI & Machine Learning
  • #Cloud
  • #Data Analytics
  • #Data Engineering
  • #Financial Services
  • #Fintech
  • #Generative AI
  • #Large Language Models
info

Location

  • USA
info

Industry

  • Financial Services
info

Services

  • AI Strategy & LLM Engineering
  • Data Engineering
  • Cloud Infrastructure & Architecture
  • FinOps & Cost Optimization
info

Technologies

  • Python
  • AWS
  • Kubernetes
Related Case Studies

Connect with our experts

Get in touch