How We Test VLMs for Image Extraction

By The Recalli Team • 6 min read • June 28, 2026

Speed matters when you're scanning a syllabus or event flyer. We tested vision-language models (VLMs) from two providers (Global and China Region) to find the fastest one. This research helps us deliver the best experience for every user.

How We Tested

Each model received the same test image in English and Traditional Chinese. We measured network + inference time from API request to response, and verified every model returned the correct data. All tests ran from California over a wired connection.

English test image: an event announcement

Chinese test image: an event announcement

Google Gemini (Global)

Model	English Image	Chinese Image
Gemini 3.1 Flash-Lite	2.1s ✅	1.4s ✅
Gemini 2.5 Flash-Lite	1.4s ✅	Failed ❌
Gemini 2.5 Flash	4.7s ✅	4.2s ✅
Gemini 3.5 Flash	6.0s ✅	6.2s ✅
Gemini 3 Flash Preview	6.1s ✅	8.5s ✅
Gemini 2.5 Pro	13.9s ✅	13.3s ✅

SiliconFlow (Hong Kong Server, China Region)

Model	English Image	Chinese Image
Gemma 4 12B	2.7s ✅	2.6s ✅
Gemma 4 26B MoE	3.7s ✅	7.2s ✅
Qwen3.6 35B MoE	9.9s ✅	7.5s ✅
Qwen3-VL 8B	4.4s ⚠️	8.2s ⚠️
Gemma 4 31B	6.5s ✅	9.3s ✅
MiniMax M3	10.8s ✅	9.5s ✅
Qwen3-VL 32B	13.4s ✅	15.5s ✅
Nex N2 Pro	21.0s ✅	15.7s ✅
GLM-5V Turbo	9.1s ✅	19.8s ✅
Qwen3.5 122B MoE	124.3s ✅	22.7s ✅
Qwen3.5 35B MoE	75.7s ✅	No items ❌
Qwen3.5 397B MoE	26.9s ✅	34.7s ✅
Kimi K2.7 Code	18.5s ✅	44.5s ✅
Qwen3.5 9B	60.7s ✅	48.8s ✅
Qwen3.6 27B	23.2s ✅	53.5s ✅
Kimi K2.5	47.6s ✅	55.9s ✅
Qwen3.5 27B	29.1s ✅	77.3s ✅
Kimi K2.6	74.5s ✅	78.8s ✅
Qwen3-VL 32B Thinking	110.5s ✅	105.7s ✅

⚠️ minor inaccuracy (wrong time format or value)