Jun 3, 2022 3 min read Transformers

How to Use Hugging Face's New Evaluate Library

Learn how to leverage Hugging Face's brand new library called Evaluate to evaluate your AI models with just a few lines of code

By now I’m sure we’ve all heard of Hugging Face — the company that leads the way for open-source AI models with their Transformers library having over 64k stars on GitHub. Just a few days ago Hugging Face released yet another Python library called Evaluate. This package makes it easy to evaluate and compare AI models. Upon its release, Hugging Face included 44 metrics such as accuracy, precision, and recall, which will be the three metrics we will cover within this tutorial. Anyone can contribute new metrics, so I suspect soon there will be far more.

There are many other metrics that I suggest you explore. For example, they included a metric called perplexity, which is used to measure the likelihood of a sequence using a model. They also included a metric called SQuAD, which is used to evaluate question answering models. The three metric we'll cover (accuracy, recall and precision) are quite fundamental and commonly used for many AI tasks, such as text classification. By reading this article, you'll gain a basic understanding of how to use Evaluate, which you can apply to quickly learn how to use other metrics.

Check out the code for this tutorial within Google Colab. Also check out this in-depth tutorial that covers how to apply Hugging Face's Evaluate Library to evaluate a text classification model.

Install

Let’s first install the Python package from PyPI.

pip install evaluate

Import

import evaluate

Metrics

We need to use a function called "load" to load each of the metrics. This function will create an EvaluationModule object.

Accuracy

Documentation for the accuracy metric

accuracy_metric = evaluate.load("accuracy")

Precision

Documentation for the precision metric

precision_metric = evaluate.load("precision")

Recall

Documentation for the recall metric

recall_metric = evaluate.load("recall")

Display

One interesting feature of the EvaluationModule objects is that documentation for them is outputted when they are printed. Below is an example of printing the accuracy metric.

print(accuracy_metric)

Output:

EvaluationModule(name: "accuracy", module_type: "metric", features: {'predictions': Value(dtype='int32', id=None), 'references': Value(dtype='int32', id=None)}, usage: """
Args:
predictions (list of int): Predicted labels.
references (list of int): Ground truth labels.
normalize (boolean): If set to False, returns the number of correctly classified samples. Otherwise, returns the fraction of correctly classified samples. Defaults to True.
sample_weight (list of float): Sample weights Defaults to None.

Returns:
accuracy (float or int): Accuracy score. Minimum possible value is 0. Maximum possible value is 1.0, or the number of examples input, if normalize is set to True.. A higher score means higher accuracy...

Data

Let's create some data to use for the metrics. The predictions variable is used to store sample outputs from a model, and the references variable contains the answer. We’ll compare the predictions variable against the references variable in the next step.

predictions = [0, 1, 1, 1, 1, 1]
references =  [0, 1, 1, 0, 1, 1]

Results

For each of the metrics, we can use the metric's "compute" method to produce a result.

Accuracy

accuracy_result = accuracy_metric.compute(references=references, predictions=predictions)

print(accuracy_result)

Output: {'accuracy': 0.8333333333333334}

The output is a dictionary with a single key called accuracy. We can isolate this value as shown below.

print(accuracy_result['accuracy'])

Output: 0.8333333333333334

Precision

precision_result = precision_metric.compute(references=references, predictions=predictions)

print(precision_result)
print(precision_result["precision"])

Output:

{'precision': 0.8}

0.8

Recall

recall_result = recall_metric.compute(references=references, predictions=predictions)

print(recall_result)
print(recall_result['recall'])

Output:

{'recall': 1.0}

1.0

Conclusion

We just covered how to use Hugging Face's evaluation library to compute accuracy, precision and recall. I suggest you now follow along this tutorial to apply what you learned to evaluate a text classification model. Be sure to subscribe to Vennify's YouTube channel and sign up for our email list.

Once again, here’s the code for this tutorial within Google Colab.