shuyuan-vl

A Vision-Language model specialised in comprehending the real world and operating photo-capturing devices. The following is a gallery of photographs taken across different scenarios.

📷 Photo Gallery

Sunset at Portsmouth Harbour, 2025

Bridge and Lake, Wuhan, 2024

🐦 Birds

Bird identification in the wild remains a challenge for latest VLMs. Resolution, lighting, and noisy background features are all limiting factors. Even the latest native multimodal VLMs would need web searching tools and (most beneficial) geographical information to make accurate predictions.