BusinessBERT

Industry-sensitive Language Model for Business. The model is available on HuggingFace: https://huggingface.co/pborchert/BusinessBERT

from transformers import AutoModel
model = AutoModel.from_pretrained("pborchert/BusinessBERT")

Summary

Pretrained Transformer: BERT-Base architecture
Trained on business communication extracted:
- Management Discussion and Analaysis statements CaltechDATA | MD&A
- Company Website content This study | CompanyWeb
- Scientific Business Literature Semantic Scholar | S2ORC
Additional pretraining objective: Industry classification (IC) predicting the standard industry classification textual documents originate from
SOTA performance on business related text classification, named entity recognition and question answering benchmarks

Abstract

We introduce BusinessBERT, a new industry-sensitive language model for business applications. The key novelty of our model lies in incorporating industry information to enhance decision-making in business-related natural language processing (NLP) tasks. BusinessBERT extends the Bidirectional Encoder Representations from Transformers (BERT) architecture by embedding industry information during pretraining through two innovative approaches that enable BusinessBert to capture industry-specific terminology: (1) BusinessBERT is trained on business communication corpora totaling 2.23 billion tokens consisting of company website content, MD&A statements and scientific papers in the business domain; (2) we employ industry classification as an additional pretraining objective. Our results suggest that BusinessBERT improves data-driven decision-making by providing superior performance on business-related NLP tasks. Our experiments cover 7 benchmark datasets that include text classification, named entity recognition, sentiment analysis, and question-answering tasks. Additionally, this paper reduces the complexity of using BusinessBERT for other NLP applications by making it freely available as a pretrained language model to the business community.

Benchmark

The benchmark consists of business related NLP tasks structured in the following categories:

Text classification

Risk: Financial risk classification based corporate disclosures. Link
News: Topic classification based on news headlines. Link

Named Entity Recognition

SEC filings: NER based on financial agreements. Link

Sentiment Analysis

FiQA: Predict continuous sentiment score based on microblog messages, news statements or headlines. Run data/fiqa/build_fiqa.py to combine the data parts and create data/fiqa/train.json. Link or Direct Download
Financial Phrasebank: Sentiment classification based on financial news. Link
StockTweets: Sentiment classification based on tweets using tags like "#SPX500" and "#stocks". Link

Question Answering

FinQA: Generative question answering based on earnings reports of S&P 500 companies. Link

Folder structure

Run makfolder.sh to create the folder structure below.

BusinessBERT
├───data
│   ├───finphrase # obsolete, load data directly from https://huggingface.co/datasets
│   ├───fiqa
│   │       task1_headline_ABSA_train.json
│   │       task1_post_ABSA_train.json
│   │       build_fiqa.py
│   │       train.json
│   │
│   ├───news # obsolete, load data directly from https://huggingface.co/datasets
│   ├───risk
│   │       groundTruth.dat
│   │
│   ├───secfilings
│   │       test.txt
│   │       train.txt
│   │       valid.txt
│   │
│   └───stocktweets
│           tweets_clean.csv
│
└───tasks
        finphrase.py
        fiqa.py
        news.py
        risk.py
        secfilings.py
        stocktweets.py
        __init__.py

Code

The business NLP benchmark results can be replicated using the run_benchmark.sh script. Note that the FinQA dataset and corresponding code is available here: https://github.com/czyssrs/finqa

for task in "risk" "news" "secfilings" "fiqa" "finphrase" "stocktweets"
do
    for model in "pborchert/BusinessBERT" "bert-base-uncased" "ProsusAI/finbert" "yiyanghkust/finbert-pretrain"
    do
        for seed in 42
        do 
            python businessbench.py \
            --task_name $task \
            --model_name $model \
            --seed $seed
        done
    done
done

The batch size and gradient accumulation parameters are selected for running the experiment on a NVIDIA RTX4000 (8GB) GPU.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/fiqa		data/fiqa
tasks		tasks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
businessbench.py		businessbench.py
makefolder.sh		makefolder.sh
run_benchmark.sh		run_benchmark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BusinessBERT

Summary

Abstract

Benchmark

Folder structure

Code

License

About

Releases

Languages

License

pnborchert/BusinessBERT

Folders and files

Latest commit

History

Repository files navigation

BusinessBERT

Summary

Abstract

Benchmark

Folder structure

Code

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages