shuyuan-vl
A Vision-Language model specialised in comprehending the real world and operating photo-capturing devices. The following is a gallery of photographs taken across different scenarios.
📷 Photo Gallery
Sunset at Portsmouth Harbour, 2025
Bridge and Lake, Wuhan, 2024
🐦 Birds
Bird identification in the wild remains a challenge for latest VLMs. Resolution, lighting, and noisy background features are all limiting factors. Even the latest native multimodal VLMs would need web searching tools and (most beneficial) geographical information to make accurate predictions.