SetFitABSA (100M) outperforms Bloomberg (50B) on FiQA SA

Moshe Wasserblat
5 min readJan 14, 2024

BloombergGPT

In their paper BloombergGPT: A Large Language Model for Finance, Bloomberg presents the development of BloombergGPT, LLM which is specialized for the financial domain.

The model is a 50B parameters language model, trained on a wide range of financial data:

  • 363 billion tokens dataset based on Bloomberg’s extensive data sources
  • 345 billion tokens from general-purpose datasets

The model was evaluated on various financial NLP tasks, including Aspect-Based Sentiment Analysis (ABSA). Bloomberg evaluated ABSA in an in-context manner using the FiQA_SA. This dataset contains a test set of 235 finance-related sentences, each prefixed by 5 tagged sentences, which overall comprise an input prompt to the model. Every tagged sentence in the prompt contains a known aspect and a question about whether its corresponding polarity is Positive, Negative, or Neutral, and the answer (the correct polarity towards the mentioned aspect) is written at the end. BloombergGPT is expected to predict the polarity of a given aspect in the test sentence based on the 5 tagged example sentences. This is an SB2 ABSA task, which means that the aspect is already given and the model needs to predict the corresponding polarity.

The evaluation score achieved by this model is weighted F1-score=75.07, as reported in Table 8 in the paper.

SetFitABSA

SetFitABSA is a framework for few-shot training of domain-specific ABSA models. It provides an efficient and accurate technique to detect sentiment toward specific aspects of the text. The following graph shows the evaluation results of a SetFitABSA model that was trained with just a few k finance-related sentences and tested on the same 235 FiQA_SA test sentences that were used to evaluate BloombergGPT. The size of the model is based on its underlying sentence transformer and is usually considerably smaller compared to SOTA LLMs.

Specifically, in this work, we used the paraphrase-mpnet-base-v2 sentence transformer which contains only 110M params compared to 50B of BloombergGPT.

We separated the tagged and non-tagged sentences from each prompt in FiQA_SA to create train and test sets correspondingly. The train set contains 646 unique sentences overall. This graph summarizes the evaluation results. It presents the weighted F1-score SetFitABSA achieves as a function of k, the number of sampled training sentences. The result for each k is averaged over 5 seeds. We can see that with only 50 training sentences SetFitABSA exceeds the BloombergGPT score. BTW, if we use the full training set for training the SetFitABSA model we get a weighted F1-score of over 86.

Demo Code

The following code demonstrates the training of a SetFitABSA model and its evaluation over the FiQA_SA financial dataset.

Specifically:

  • We sample k=24 sentences from the training set using an arbitrary seed
  • We trained the SetFitABSA model using only these k samples
  • We evaluate the trained model over the entire test set which is exactly the same set that was used to evaluate BloombergGPT model. The evaluated task is SB2 — aspects are given and the model needs to predict their corresponding polarity

With just 24 training sentences we get a weighted F1-score which is on par or better than the 75.07 score achieved by BloombergGPT (try it yourself!)

Install SetFit with SetFitABSA option

We have to install SetFit as well as download a spaCy model for doing the initial aspect span candidate selection. We will download en_core_web_lg.

!pip install -U "setfit[absa]"
!python -m spacy download en_core_web_lg

Import required packages

from setfit import AbsaTrainer, TrainingArguments, AbsaModel
from datasets import load_dataset
from sklearn.metrics import f1_score

Load the FiQA_SA dataset

dataset = load_dataset("ronenlap/SetFitAbsa_FiQA")
train_ds = dataset["train"]
test_ds = dataset["test"]

Simulate the few-shot regime by sampling k text reviews for training

k = 24
seed = 35
experiment_ds = train_ds.shuffle(seed=seed).select(range(k))

Training a SetFitABSA Model

Initialize an ABSA model

We’ll initialize an AbsaModel using the strong paraphrase-mpnet-base-v2 base model.

model = AbsaModel.from_pretrained(
"sentence-transformers/paraphrase-mpnet-base-v2",
)

Setting the training arguments

args = TrainingArguments(
num_epochs=1,
batch_size=4,
num_iterations=20,
save_strategy="no",
report_to="none"
)

Creating a trainer object and executing the SetFitABSA model

We need to apply a column mapping as the AbsaTrainer expects "text", "span", "label" and "ordinal" columns. The text refers to the sentence, whereas span is an aspect span in the sentence. The label is the corresponding label (e.g. "positive"), while ordinal is used to distinguish spans if they occur multiple times in the sentence. For example, if "stock" is the current aspect span, and it occurs 3 times in the sentence, an ordinal of 0 indicates that the sample is referring to the first occurrence.

trainer = AbsaTrainer(
model,
args=args,
train_dataset=experiment_ds, # if you want to train over the entire train set change experiment_ds to train_ds
column_mapping={
"sentence": "text",
"aspect": "span",
"polarity": "label",
"ordinal": "ordinal",
},
)

trainer.train()

Evaluation

As we mentioned above, the evaluated ABSA task is SB2 — aspects are given and the model just needs to predict their corresponding polarity.

SetFitABSA supports this by providing the predict method with a dataset in a similar format as required for training, minus the label column.

# rename columns according to predict() requirements
test_ds = test_ds.rename_columns({"sentence": "text","aspect": "span"})
output  = model.predict(test_ds) # a new column which holds the predicted polarity, "pred_polarity", is added to the dataset
output
Dataset({
features: ['text', 'span', 'polarity', 'ordinal', 'pred_polarity'],
num_rows: 235
})

Then, we can compute the weighted F1 score by taking both the gold polarity and the predicted polarity.

weighted_f1_score = f1_score(output["polarity"], output["pred_polarity"], average="weighted")
print(f"weighted_f1_score: {weighted_f1_score}")
weighted_f1_score: 0.7819641265152422

Inference

Now that we’ve trained and evaluated the model, let’s also try it out with some examples to get an intuitive feel of the model as well.

sentences = [
"#Tesla: Model X Recall Adds To Reliability Issues $TSLA https://t.co/jVXQ4DoXnP",
"$CIEN seems to have broken out of a major horizontal resistance. Targets $14.35.",
"$AAPL I am big OUT from this. seems its falling towards 530.. :(",
]

model.predict(sentences)
[[{'span': 'Tesla', 'polarity': 'Negative'}],
[{'span': 'CIEN', 'polarity': 'Positive'}],
[{'span': 'AAPL', 'polarity': 'Negative'}]]

Notebook available here

--

--

Moshe Wasserblat

Natural Language Processing (NLP) and Deep Learning (DL) research group manager