⚡ News

Benchmarking Google Embeddings 2 vs Open-Source Models for Multilingual RAG

Benchmarking Google Embeddings 2 vs Open-Source Models for Multilingual RAG

In a newly released benchmark for multilingual dense retrieval and retrieval-augmented generation (RAG) systems, researchers evaluated Google Embeddings 2 (GE2)—a Vertex-AI-hosted bi-encoder featuring a 2,048-token context and explicit task-type conditioning—against five prominent open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet).

The comprehensive evaluation spanned four BEIR subsets, a synthetic Italian RAG corpus (IT-RAG-Bench), a chunking ablation study testing 5 token sizes across 3 strategies, and per-query latency measurements on commodity CPU hardware. GE2 ranked first on every single task, achieving a BEIR average nDCG@10 of 0.638 and an IT-RAG-Bench nDCG@10 of 0.282. However, this superior performance comes at a high cost: GE2 registered a median latency of 231.6 ms, making it roughly 14 times slower than the fastest local models.

Interestingly, the open-source Multilingual-E5-large (mE5-L) proved to be an outstanding alternative. On the Italian dataset, mE5-L scored within a mere 0.003 nDCG of GE2 while maintaining a highly responsive latency of just 31 ms. This makes mE5-L the preferred choice for real-time systems where sub-100 ms SLAs are critical. Conversely, the widely deployed LaBSE model disappointed, scoring an average of only 0.188 nDCG@10 on BEIR, falling behind even the basic mMPNet model.

The study's chunking ablation experiments revealed a surprising insight: performance for all six models saturated at a remarkably small chunk size of 32 tokens. Advanced semantic chunking provided measurable gains only when constrained to an extremely compact size of 16 tokens.

[AgentUpdate Depth Analysis] Embeddings serve as the bedrock for AI Agent memory and RAG workflows. While Google Embeddings 2 sets a new benchmark for accuracy, its 231.6 ms latency presents a significant hurdle for complex, multi-step Agent loops (such as ReAct or planning agents) where high-frequency retrieval is critical. The near-parity performance of local models like mE5-L at just 31 ms highlights a growing shift toward edge-based, high-throughput retrieval pipelines. Furthermore, the revelation that performance saturates at 32-token chunks challenges the convention of indexing large text paragraphs. It suggests that future Agent architectures should pivot toward ultra-fine-grained micro-chunking paired with fast local embedding models to optimize real-time cognitive cycles and minimize context overhead.

↗ Read original source