3 min read

Image Classification With Hugging Face's Transformers Library

Classify images using pretrained Vision Transformers with Hugging Face's transformers library
Image Classification With Hugging Face's Transformers Library

I’m sure most of us heard of Transformer models advancing the field of NLP by now. In 2017, a team of researchers published a paper titled “Attention Is All You Need” that proposed the Transformer model and broke records for machine translation [1]. Since then, there have been consistent innovations, with state-of-the-art Transformer models currently outperforming humans on the General Language and Evaluation benchmark [2].

Researchers have begun to implement Transformer models for computer vision. In this article, we'll discuss how to implement the model outlined in the paper published by Google Brain titled "An Image is worth 16x16 words: Transformers for Image Recognition at Scale" [3]. This paper shows that Transformer models can achieve state-of-the-art performance while requiring less computational power when applied to image classification compared to previous state-of-the-art methods.

We’ll implement a Vision Transformer using Hugging Face’s transformers library. Hugging Face is best known for their NLP Transformer tools, and now they are expanding into Vision Transformers. By using Hugging Face's transformers library, we'll be able to implement a Vision Transformer model without too many complexities.


First off, we need to install Hugging Face's transformers library.

pip install transformers

Then, we’ll install a library called Pillow. Pillow allows us to create objects we will input into the transformers library for the images.

pip install pillow



There are two classes we need to import from transformers: ViTFeatureExtractor and ViTForImageClassification. ViTForImageClassification is the class we’ll use to instantiate our model. ViTFeatureExtractor will be used to prepare the image.

from transformers import ViTForImageClassification, ViTFeatureExtractor


from PIL import Image
import request 

Model Instantiation

We’ll use ViForImageClassification’s “from_pretrained” method to load a model from Hugging Face’s model distribution network. Google’s “google/vit-base-patch16–224” model is currently the most downloaded ViTForImageClassification model and will be used for this tutorial.

There are larger models you can use instead that would result in better performance but larger computational requirements. For example, “google/vit-large-patch32-384” has higher accuracy and can run within a free Google Colab GPU instance. Other models can be found here.

model_name = 'google/vit-base-patch16-224'
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

Put the model in evaluation mode.


Loading an Image

We’ll use both the request and Pillow to load an image from the web to classify. I suggest you change the URL to play around with the model.

url = 'https://live.staticflickr.com/65535/51177932190_73fd0ce6f2_h.jpg'
image = Image.open(requests.get(url, stream=True).raw)

Feature Extractor

We’ll now create a feature extractor using the default settings.

feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)

From here, we can use the feature extractor to generate the encoding that will be inputted into the model. The output is a Batch Feature and contains a single field called “pixel_values.”

encodings = feature_extractor(images=image, return_tensors="pt")

Calling The Model

We now have everything we need to call the model! Let’s provide the pixel values to the model.

pixel_values = encodings["pixel_values"]
outputs = model(pixel_values) 

Extracting the Answer

Run the following code to retrieve the prediction. To explain in layman's terms, the model outputs a score for each possible answer as “logits.” In line one, we extract these values. Then, in line two, we retrieve the index for the answer with the highest score. Finally, we convert the index into a label.

logits = outputs.logits
predicted_class_idx = logits.argmax(-1).item()
answer = model.config.id2label[predicted_class_idx]

Let’s print the result!

print("Predicted answer: " answer)


Answer: Blenheim spaniel

GPU Usage

There are a few additional steps you must follow if you wish to use a GPU.

First, we need to detect if a GPU is available and then move the model to it if it is.

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

You can then follow all of the steps outlined above. Except, before providing the pixel values to the model, first move them to the proper device.

pixel_values = encodings["pixel_values"].to(device)
outputs = model(pixel_values) 

Now, you can follow the steps outlined in the Extracting the Answer section.


And that's it! Congratulation, you just learned how to implement a state-of-the-art image classification model. There are endless applications for image classification models, and hopefully, you'll be able to implement what you learned to help better the lives of others.




[1] https://arxiv.org/pdf/1706.03762.pdf

[2] https://gluebenchmark.com/leaderboard

[3] https://arxiv.org/pdf/2010.11929.pdf

Subscribe to my YouTube channel for more content like this.

Book a Call

We may be able to help you or your company with your next NLP project. Feel free to book a free 15 minute call with us.