NOVIC: Unconstrained Open Vocabulary Image Classification

NOVIC is...

Generative
Image classifications are produced directly as free-form text

Prompt-free
Does not require any input list of candidate classifications

Text-only Trained
No images are used at all during training, only synthetic text!

Efficient
Trains in 1-3 days on a single RTX A6000 (fast as no images required)

Real-time
26ms inference time per single image, or 7ms per image if batched

Open Source
Detailed GitHub repository with commands to reproduce results

Open Vocabulary
Trained using a full English dictionary of nouns (no core or prioritized vocabulary)

Scalable
Performance scales well with the performance of the underlying CLIP model

Reliable
Achieves accurate fine-grained prediction scores up to 87.5% on arbitrary images

Live Demo

You can try out NOVIC yourself on your own images using the Hugging Face Spaces Live Demo! Alternatively, you can easily get started inferencing NOVIC yourself on a local GPU, using the pre-built NOVIC Docker image and the associated compact instructions.

Screenshot of live NOVIC model demo on Hugging Face Spaces

Approach

Model training and inference pipeline overview schematic

NOVIC is an innovative uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language. Leveraging the extensive knowledge of CLIP models, NOVIC harnesses the embedding space to enable zero-shot transfer from pure text to images. Traditional CLIP models, despite their ability for open vocabulary classification, require an exhaustive prompt of potential class labels, restricting their application to images of known content or context. To address this, NOVIC uses an object decoder model that is trained on a large-scale 92M-target dataset of templated object noun sets and LLM-generated captions to always output the object noun in question. This effectively inverts the CLIP text encoder and allows textual object labels to be generated directly from image-derived embedding vectors, without requiring any a priori knowledge of the potential content of an image. NOVIC has been tested on a mix of manually and web-curated datasets, as well as standard image classification benchmarks, and achieves fine-grained prompt-free prediction scores of up to 87.5%, a strong result considering the model must work for any conceivable image and without any contextual clues.

Object decoder model architecture overview schematic

At the heart of the NOVIC architecture is the object decoder, which effectively inverts the CLIP text encoder, and learns to map CLIP embeddings to object noun classification labels in the form of tokenized text. During training, a synthetic text-only dataset is used to train the object decoder to map the CLIP text embeddings corresponding to templated/generated captions to the underlying target object nouns. During inference, zero-shot transfer is used to map CLIP image embeddings (as opposed to text embeddings) to predicted object nouns. The ability of the object decoder to generalize from text embeddings to image embeddings is non-trivial, as there is a huge modality gap between the two types of embeddings (for all CLIP models), with the embeddings in fact occupying two completely disjoint areas of the embedding space, with much gap in-between.

Training Datasets

Example prompt-templated and LLM-generated caption-object text pairs

Despite being an image classification model, NOVIC is trained exclusively on synthetic textual data in the form of matching caption-object pairs. These are primarily generated using a multi-object prompt templating strategy based on the Object Noun Dictionary. The remaining caption-object pairs were one-time generated using an LLM based on the same dictionary of nouns. The complete data required for training is available on Papers With Code as the NOVIC Caption-Object Data.

Evaluation Datasets

In order to properly evaluate a totally novel generative image classification model like NOVIC, three new open vocabulary datasets were constructed and annotated. These datasets are generic, i.e. not specific to NOVIC, and are referred to as the OVIC Datasets. Refer to Papers With Code and the WACV Paper for more information. The three new evaluation datasets are as follows:

World (272 images)

Mostly never-online photos from 10 countries by 12 people, including diverse, atypical, deceptive, and indirect objects.

Wiki (1000 images)

Wikipedia article lead images that were sampled from a pool of 18K such scraped images.

Val3K (3000 images)

Images from the ImageNet-1K validation set, sampled uniformly across the classes.

Presentation Video

BibTeX

If you use this project in your research, please cite the WACV 2025 paper, possibly as well as the GitHub repository:

@InProceedings{allgeuer_novic_2025,
    author    = {Philipp Allgeuer and Kyra Ahrens and Stefan Wermter},
    title     = {Unconstrained Open Vocabulary Image Classification: {Z}ero-Shot Transfer from Text to Image via {CLIP} Inversion},
    booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    year      = {2025},
}

@Misc{github_novic,
    title = {{NOVIC}: {U}nconstrained Open Vocabulary Image Classification},
    author = {Philipp Allgeuer and Kyra Ahrens},
    url = {https://github.com/pallgeuer/novic},
}

NOVIC: Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

Given an image and nothing else, NOVIC can generate an accurate fine-grained textual classification label in real-time, with coverage of the vast majority of the English language.