Develop an open‑source multimodal version of GPT‑OSS (20B/120B) by integrating a lightweight vision encoder through projection‑based alignment. The system will support image captioning, visual question answering, and multimodal instruction following while preserving the strong textual reasoning abilities of the base model. All training scripts, dataset manifests, and benchmark results will be released publicly to enable reproducibility and community extensions.
This is a Complete Project Template
Sign in to view the full implementation details, methodology, and resources.