3 min read

How to Use Hugging Face's New Evaluate Library

Learn how to leverage Hugging Face's brand new library called Evaluate to evaluate your AI models with just a few lines of code
How to Use Hugging Face's New Evaluate Library

By now I’m sure we’ve all heard of Hugging Face — the company that leads the way for open-source AI models with their Transformers library having over 64k stars on GitHub. Just a few days ago Hugging Face released yet another Python library  called Evaluate. This package makes it easy to evaluate and compare AI models. Upon its release, Hugging Face included 44 metrics such as accuracy, precision, and recall, which will be the three metrics we will cover within this tutorial. Anyone can contribute new metrics, so I suspect soon there will be far more.

There are many other metrics that I suggest you explore. For example, they included a metric called perplexity, which is used to measure the likelihood of a sequence using a model. They also included a metric called SQuAD, which is used to evaluate question answering models. The three  metric we'll cover (accuracy, recall and precision) are quite fundamental and commonly used for many AI tasks, such as text classification. By reading this article, you'll gain a basic understanding of how to use Evaluate, which you can apply to quickly learn how to use other metrics.

Check out the code for this tutorial within Google Colab. Also check out this in-depth tutorial that covers how to apply Hugging Face's Evaluate Library to evaluate a text classification model.


Let’s first install the Python package from PyPI.

pip install evaluate


import evaluate


We need to use a function called "load" to load each of the metrics. This function will create an EvaluationModule object.


Documentation for the accuracy metric

accuracy_metric = evaluate.load("accuracy")


Documentation for the precision metric

precision_metric = evaluate.load("precision")


Documentation for the recall metric

recall_metric = evaluate.load("recall")


One interesting feature of the EvaluationModule objects is that documentation for them is outputted when they are printed. Below is an example of printing the accuracy metric.



EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
predictions (list of int): Predicted labels.
references (list of int): Ground truth labels.
normalize (boolean): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
sample_weight (list of float): Sample weights Defaults to None.

accuracy (float or int): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if normalize is set to True.. A higher score means higher accuracy...


Let's create some data to use for the metrics. The predictions variable is used to store sample outputs from a model, and the references variable contains the answer. We’ll compare the predictions variable against the references variable in the next step.

predictions = [0, 1, 1, 1, 1, 1]
references =  [0, 1, 1, 0, 1, 1]


For each of the metrics, we can use the metric's "compute" method to produce a result.


accuracy_result = accuracy_metric.compute(references=references, predictions=predictions)


Output: {'accuracy': 0.8333333333333334}

The output is a dictionary with a single key called accuracy. We can isolate this value as shown below.


Output: 0.8333333333333334


precision_result = precision_metric.compute(references=references, predictions=predictions)



{'precision': 0.8}



recall_result = recall_metric.compute(references=references, predictions=predictions)



{'recall': 1.0}



We just covered how to use Hugging Face's evaluation library to compute accuracy, precision and recall. I suggest you now follow along this tutorial to apply what you learned to evaluate a text classification model. Be sure to subscribe to Vennify's YouTube channel and sign up for our email list.  

Once again, here’s the code for this tutorial within Google Colab.

Book a Call

We may be able to help you or your company with your next NLP project. Feel free to book a free 15 minute call with us.