Zero-cost, ≈Zero inference-time, Zero-shot Financial Sentiment Analysis

5 min readMar 24, 2024

Hugging Face’s Moritz Laurer demonstrated in a great blog how to utilize pseudo-labeling data to achieve GPT-4 like performance with significantly reduced budget, carbon emission, and inference time. In practice, Moritz distilled Mixtral’s knowledge into RoBERTa using a zero-shot text classification prompt and achieved no accuracy loss. In this post, we will try to take this work one step further and utilize extreme compression of 5–6 orders of magnitude without any accuracy loss. We will show how you can run a simple MLP (Multilayer Perceptron) neural network (almost zero inference time) on a low-resource device (zero cost) and achieve GPT-4 like performance without any labeled data (zero shot). We will demonstrate it on a financial sentiment analysis task.

Let's start with loading Mixtral’s pseudo-labeling data directly from Laurer’s GitHub.

test_df = pd.read_csv('https://raw.githubusercontent.com/MoritzLaurer/synthetic-data-blog/main/data/df_test_HF_2024-02-06-21-57_mixtral.csv')
test_df.head()

train_df = pd.read_csv('https://raw.githubusercontent.com/MoritzLaurer/synthetic-data-blog/main/data/df_train_HF_2024-02-06-21-57_mixtral.csv')
train_df.head()

test_dataset = Dataset.from_dict({"text": test_df['text'], "label": test_df['label_experts']})  # Expert labeling
train_dataset = Dataset.from_dict({"text": train_df["text"], "label": train_df['label_llm_cot_multiple']}) # Pseudo labeling Mixtral CoT+SC

Figure 1: Zero-Shot Financial sentiment detection performance as evaluated in Laurer’s blog: Zeroshot for the 3 generative LLMs and a fine-tuned RoBERTa based on the zeroshot pseudo-labeled data from Mixtral (CoT + SC) (1800~ data rows/texts).

Train SetfIt with Mixtral pseudo-labeling data.

Here we distill Mixtral knowledge into SetFit (in a similar way as Laurer demonstrated for RoBERTa). Please note ‘bge-small-en-v1.5’ model includes only 30M parameters compared to RoBERTa-base’s 130M.

model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

args = TrainingArguments(
    num_iterations = 5    # Modify the few-shot defualt value
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)
trainer.train()

metrics = trainer.evaluate()
print(metrics)

{'accuracy': 0.9405}

A Question

Is it possible to further distill to a tiny MLP model (~1M parameters and even lower) without any accuracy loss?

If Yes, we can deploy a model that produces GPT-4 performance on tiny devices such as smartwatches.

Following my blog “Best Practices for Text Classification” we need first to estimate the Word Order Sensitivity (WOS) of the dataset.

If the WOS is relatively small it means that the prediction is not sensitive to word order hence a non-contextual model like MLP should be sufficient.

What is Word Order Sensitivity?

Thang et al. showed a surprising phenomenon: between 75% and 90% of the correct predictions of Transformer-based classifiers trained on General Language Understanding Evaluation (GLUE) tasks remained unchanged when the input words were randomly shuffled! The authors further suggested a simple metric to measure a dataset’s sensitivity to word order:

WOS (Word Order Sensitivity) = (100-p)/(100-b), where p ∈ [50, 100] is the accuracy of a GLUE-trained model evaluated on a dev-s set and WOS∈ [0, 1]. In the paper, b = 50 (binary).

Empirically, we find that WOS is a good proxy estimate of the dataset complexity for the Text-Classification task. A low value of WOS (<0.2) will most likely allow no-loss distillation into tiny models.

Train MLP

For FSA we evaluated WOS to be 0.197, see our full notebook.

The WOS implies that on average 80.3 of the sentences in the FAS dataset are rather simple.

Great, we got a relatively small value of WOS (<0.2), so let’s try to train an MLP based on Mixtral’s pseudo-labeling.

Train the MLP using Mixtral pseudo-labeling data.

# sklearn's MLP
model_mlp = make_pipeline(CountVectorizer(ngram_range=(1,1), max_features=5000), MLPClassifier(random_state=1,early_stopping=True,hidden_layer_sizes=(64))).fit(x_train, y_train)
predicted = model_mlp.predict(x_test)
print(accuracy_score(y_test, predicted))
print(classification_report(y_test, predicted))

0.8167770419426048
              precision    recall  f1-score   support

    negative       0.71      0.64      0.67        61
     neutral       0.84      0.97      0.90       265
    positive       0.80      0.59      0.68       127

    accuracy                           0.82       453
   macro avg       0.78      0.73      0.75       453
weighted avg       0.81      0.82      0.81       453

Whoops, not what we expected! 82% accuracy is nice but substantially far from a GPT4/Mixtral performance.

What went wrong?

Two major issues:

The teacher (Mixtral) model pseudo labels are hard decisions (integer numbers that represent the classes) rather than soft decisions (probabilities). This limits the ability of the MLP (the student) to mimic Mixtral (the teacher) in Moritz Laurer’s blog Mixtral/GPT4 produces only hard decisions. It was ok for relatively large models like RoBERTa/SetFit but not sufficient for a tiny model such as MLP.
Usually, distillation into tiny models requires much more pseudo-labeled data for training.

Solution: We can try to generate a soft decision with GPT4/Mixtral and this time also augment with more available un-labeled data. But GPT4/Mixtral regeneration is time-consuming and needs proper prompt design, on top of that the quality of GPT4/Mixtral produced probabilities is doubtful. Another much simpler solution would be to utilize the distilled SetFit (or RoBERTa) model to generate prediction probabilities.

Train MLP with SetFit prediction

Here are the 3 steps to distill into MLP:

Utilize all available Financial data (human label is not required)
Generate pseudo label with SetFit
Train MLP

# Utilaize the full Financial data that is available from Hugging Face's Hub.
dataset = load_dataset("financial_phrasebank", "sentences_50agree", split='train')

# Assign only the text data. We don't care about the human labels since SetFit will generate the pseudo prediction probabilities.
x_train_50agree = dataset['sentence']

# Predict probabilities (generate with the distilled SetFit)
y_train_50agree_predict = model.predict_proba(x_train_50agree)

# Train MLP with the predicted probabilities
# Please note that this time we utilaize sklearn\MLPRegressor since we need to mimic the teacher soft decision (not the numeric labels)
from sklearn.neural_network import MLPRegressor
model_mlp_x = make_pipeline(CountVectorizer(ngram_range=(1,1), max_features=5000), MLPRegressor(random_state=1, early_stopping=True,hidden_layer_sizes=(256))).fit(x_train_50agree, y_train_50agree_predict)
predicted = model_mlp_x.predict(x_test)
print(accuracy_score(y_test_num, np.argmax(predicted, axis=1)))
print(classification_report(y_test_num, np.argmax(predicted, axis=1)))

0.9403973509933775
              precision    recall  f1-score   support

           0       0.89      0.93      0.91        61
           1       0.96      0.97      0.96       265
           2       0.93      0.89      0.91       127

    accuracy                           0.94       453
   macro avg       0.93      0.93      0.93       453
weighted avg       0.94      0.94      0.94       453

MLP’s Results Summary

MLP 128K:
hidden_layer_sizes = 128
max_features=1000
Accuracy = 93

MLP 1.28M:
hidden_layer_sizes = 256
max_features=5000
Accuracy = 94

Nice, this time we got it right! We were able to achieve Mixtral/GPT-4 performance with a tiny MLP model (:

MLP is super fast for inference (≈zero inference-time) and training on CPU, running on your edge device (zero-cost), and human labels are not needed (zero-shot).

Figure 2: Zero-Shot Financial sentiment detection performance. Zeroshot for the 3 generative LLMs and a fine-tuned MLP-1.28M based on the zeroshot pseudo-labeled data from SetFit (4850 data rows/texts)

Figure 3: Performance Summary vs. Model Size (# of parameters)