First Benchmark Introduced for Evaluating Multimodal AI Models in Russian

The open Russian-language benchmark MWS Vision Bench is designed to assess the quality of multimodal artificial intelligence models (Visual Language Models, VLM) capable of analyzing both images and text simultaneously.

Generated by the Midjourney neural network

As emphasized by MTS Web Services, the new MWS Vision Bench is the first benchmark focused on evaluating multimodal models in real-world product scenarios where it is necessary to work with documents in Russian. The new tool allows testing generative artificial intelligence in recognizing and understanding documents containing visual data. The company explained:

Modern models can analyze contracts, invoices, forms, diagrams, and tables. However, existing international benchmarks, such as OCRBench, AI2D, and MMMU, only cover English and Chinese. There were no suitable benchmarks in Russian until now, which made it impossible to objectively evaluate such models when solving product tasks in Russian companies.

MWS Vision Bench includes 800 images and 2580 tasks, reflecting real-world scenarios of working with documents in Russian organizations. The set includes office and personal documents, diagrams, handwritten notes, tables, drawings, charts, and graphs. The original dataset was randomly divided into two parts: a validation set (400 images, 1302 tasks) and a test set (400 images, 1278 tasks). The validation part of the benchmark is publicly available.

The open source code of the benchmark is published on GitHub, and the dataset is available on the Hugging Face platform. This will allow companies to upload and test both their own and third-party models. Currently, the best results in the benchmark were shown by Gemini 2.5 Pro, Claude Sonnet 4.5, and ChatGPT-4.1 mini, respectively. ChatGPT-5 and Qwen3-VL also participated in the comparison.