Industry-sensitive Language Model for Business. The model is available on HuggingFace: https://huggingface.co/pborchert/BusinessBERT
from transformers import AutoModel
model = AutoModel.from_pretrained("pborchert/BusinessBERT")
- Pretrained Transformer: BERT-Base architecture
- Trained on business communication extracted:
- Management Discussion and Analaysis statements CaltechDATA | MD&A
- Company Website content This study | CompanyWeb
- Scientific Business Literature Semantic Scholar | S2ORC
- Additional pretraining objective: Industry classification (IC) predicting the standard industry classification textual documents originate from
- SOTA performance on business related text classification, named entity recognition and question answering benchmarks
We introduce BusinessBERT, a new industry-sensitive language model for business applications. The key novelty of our model lies in incorporating industry information to enhance decision-making in business-related natural language processing (NLP) tasks. BusinessBERT extends the Bidirectional Encoder Representations from Transformers (BERT) architecture by embedding industry information during pretraining through two innovative approaches that enable BusinessBert to capture industry-specific terminology: (1) BusinessBERT is trained on business communication corpora totaling 2.23 billion tokens consisting of company website content, MD&A statements and scientific papers in the business domain; (2) we employ industry classification as an additional pretraining objective. Our results suggest that BusinessBERT improves data-driven decision-making by providing superior performance on business-related NLP tasks. Our experiments cover 7 benchmark datasets that include text classification, named entity recognition, sentiment analysis, and question-answering tasks. Additionally, this paper reduces the complexity of using BusinessBERT for other NLP applications by making it freely available as a pretrained language model to the business community.
The benchmark consists of business related NLP tasks structured in the following categories:
Text classification
- Risk: Financial risk classification based corporate disclosures. Link
- News: Topic classification based on news headlines. Link
Named Entity Recognition
- SEC filings: NER based on financial agreements. Link
Sentiment Analysis
- FiQA: Predict continuous sentiment score based on microblog messages, news statements or headlines. Run
data/fiqa/build_fiqa.py
to combine the data parts and createdata/fiqa/train.json
. Link or Direct Download - Financial Phrasebank: Sentiment classification based on financial news. Link
- StockTweets: Sentiment classification based on tweets using tags like "#SPX500" and "#stocks". Link
Question Answering
- FinQA: Generative question answering based on earnings reports of S&P 500 companies. Link
Run makfolder.sh
to create the folder structure below.
BusinessBERT
├───data
│ ├───finphrase # obsolete, load data directly from https://huggingface.co/datasets
│ ├───fiqa
│ │ task1_headline_ABSA_train.json
│ │ task1_post_ABSA_train.json
│ │ build_fiqa.py
│ │ train.json
│ │
│ ├───news # obsolete, load data directly from https://huggingface.co/datasets
│ ├───risk
│ │ groundTruth.dat
│ │
│ ├───secfilings
│ │ test.txt
│ │ train.txt
│ │ valid.txt
│ │
│ └───stocktweets
│ tweets_clean.csv
│
└───tasks
finphrase.py
fiqa.py
news.py
risk.py
secfilings.py
stocktweets.py
__init__.py
The business NLP benchmark results can be replicated using the run_benchmark.sh
script. Note that the FinQA dataset and corresponding code is available here: https://github.com/czyssrs/finqa
for task in "risk" "news" "secfilings" "fiqa" "finphrase" "stocktweets"
do
for model in "pborchert/BusinessBERT" "bert-base-uncased" "ProsusAI/finbert" "yiyanghkust/finbert-pretrain"
do
for seed in 42
do
python businessbench.py \
--task_name $task \
--model_name $model \
--seed $seed
done
done
done
The batch size and gradient accumulation parameters are selected for running the experiment on a NVIDIA RTX4000 (8GB) GPU.
This work is licensed under a Creative Commons Attribution 4.0 International License.