How We Test VLMs for Image Extraction

By The Recalli Team 6 min read June 28, 2026

Speed matters when you're scanning a syllabus or event flyer. We tested vision-language models (VLMs) from two providers (Global and China Region) to find the fastest one. This research helps us deliver the best experience for every user.

How We Tested

Each model received the same test image in English and Traditional Chinese. We measured network + inference time from API request to response, and verified every model returned the correct data. All tests ran from California over a wired connection.

English test image: an event announcement Chinese test image: an event announcement

Google Gemini (Global)

ModelEnglish ImageChinese Image
Gemini 3.1 Flash-Lite2.1s ✅1.4s ✅
Gemini 2.5 Flash-Lite1.4s ✅Failed ❌
Gemini 2.5 Flash4.7s ✅4.2s ✅
Gemini 3.5 Flash6.0s ✅6.2s ✅
Gemini 3 Flash Preview6.1s ✅8.5s ✅
Gemini 2.5 Pro13.9s ✅13.3s ✅

SiliconFlow (Hong Kong Server, China Region)

ModelEnglish ImageChinese Image
Gemma 4 12B2.7s ✅2.6s ✅
Gemma 4 26B MoE3.7s ✅7.2s ✅
Qwen3.6 35B MoE9.9s ✅7.5s ✅
Qwen3-VL 8B4.4s ⚠️8.2s ⚠️
Gemma 4 31B6.5s ✅9.3s ✅
MiniMax M310.8s ✅9.5s ✅
Qwen3-VL 32B13.4s ✅15.5s ✅
Nex N2 Pro21.0s ✅15.7s ✅
GLM-5V Turbo9.1s ✅19.8s ✅
Qwen3.5 122B MoE124.3s ✅22.7s ✅
Qwen3.5 35B MoE75.7s ✅No items ❌
Qwen3.5 397B MoE26.9s ✅34.7s ✅
Kimi K2.7 Code18.5s ✅44.5s ✅
Qwen3.5 9B60.7s ✅48.8s ✅
Qwen3.6 27B23.2s ✅53.5s ✅
Kimi K2.547.6s ✅55.9s ✅
Qwen3.5 27B29.1s ✅77.3s ✅
Kimi K2.674.5s ✅78.8s ✅
Qwen3-VL 32B Thinking110.5s ✅105.7s ✅

⚠️ minor inaccuracy (wrong time format or value)