The advancement in language and vision models (LLM) has seen remarkable progress in recent years. However, the training of these large-scale models poses challenges in terms of time and computational resources, particularly with regards to deep learning models that require powerful GPUs. This has created a significant gap between Big Tech companies, who possess substantial resources to train LLM models, and academia, which often lacks the necessary means to contribute significantly in this field.
To address this issue, we propose an innovative open vocabulary framework called CaSED (Category Search from External Databases). Unlike traditional LLM models, CaSED does not rely on extensive training. Instead, it leverages retrieval techniques from an image-text knowledge base to classify or tag images automatically. By utilizing existing knowledge bases rather than requiring extensive training, CaSED eliminates the need for large computational resources.
In this talk, we will show the potential of low-budget approaches in the context of Language and Vision.