Financial News Classifier and Sentiment Analysis (LLM & FinBert)
Text Cleaning includes following steps (but not limited to) :
- Lower Casing
- Removal of punctuations
- Removal of stopwords
- Removal of frequent words
- Removal of very rare words
- Stemming
- Lemmatization
- Removal of emojis
- removal of emoticons
- Conversion of emoticons to words
- Conversion of emojis to words
- Use of regular expressions (removal of URL, HTML tags , Phone no, Email id, etc)
- Chat words conversion
- Spelling correction
- Removal of non-english words
Example Notebook
For Topic Modeling we can use
- LDA
- Guided (Seeded) LDA
- Anchored CorEx
Example Notebook
- Text Summarization with Seq2Seq Model
- Text summarization with transformers
- text summarization with Hugging Face transformers
Example Notebook
Example Notebook
Example Notebook
Official Libraries
First-party cool stuff made with love by Hugging Face.
- transformers – State-of-the-art natural language processing for Jax, PyTorch and TensorFlow.
- datasets – The largest hub of ready-to-use NLP datasets for ML models with fast, easy-to-use, and efficient data manipulation tools.
- tokenizers – Fast state-of-the-Art tokenizers optimized for research and production.
- knockknock – Get notified when your training ends with only two additional lines of code.
- accelerate – A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision.
- autonlp – Train state-of-the-art natural language processing models and deploy them in a scalable environment automatically.
- nn_pruning – Prune a model while finetuning or training.
- huggingface_hub – Client library to download and publish models and other files on the huggingface.co hub.
- tune – A benchmark for comparing Transformer-based models.
Tutorials
Learn how to use Hugging Face toolkits, step-by-step.
- Official Course (from Hugging Face) – The official course series provided by Hugging Face.
- transformers-tutorials (by @nielsrogge) – Tutorials for applying multiple models on real-world datasets.
NLP Toolkits
NLP toolkits built upon Transformers. Swiss Army!
- AllenNLP (from AI2) – An open-source NLP research library.
- Graph4NLP – Enabling easy use of Graph Neural Networks for NLP.
- Lightning Transformers – Transformers with PyTorch Lightning interface.
- Adapter Transformers – Extension to the Transformers library, integrating adapters into state-of-the-art language models.
- Obsei – A low-code AI workflow automation tool and performs various NLP tasks in the workflow pipeline.
- Trapper (from OBSS) – State-of-the-art NLP through transformer models in a modular design and consistent APIs.
Text Representation
Converting a sentence to a vector.
- Sentence Transformers (from UKPLab) – Widely used encoders computing dense vector representations for sentences, paragraphs, and images.
- WhiteningBERT (from Microsoft) – An easy unsupervised sentence embedding approach with whitening.
- SimCSE (from Princeton) – State-of-the-art sentence embedding with contrastive learning.
- DensePhrases (from Princeton) – Learning dense representations of phrases at scale.
Inference Engines
Highly optimized inference engines implementing Transformers-compatible APIs.
- TurboTransformers (from Tencent) – An inference engine for transformers with fast C++ API.
- FasterTransformer (from Nvidia) – A script and recipe to run the highly optimized transformer-based encoder and decoder component on NVIDIA GPUs.
- lightseq (from ByteDance) – A high performance inference library for sequence processing and generation implemented in CUDA.
- FastSeq (from Microsoft) – Efficient implementation of popular sequence models (e.g., Bart, ProphetNet) for text generation, summarization, translation tasks etc.
Model Scalability
Parallelization models across multiple GPUs.
- Parallelformers (from TUNiB) – A library for model parallel deployment.
- OSLO (from TUNiB) – A library that supports various features to help you train large-scale models.
- Deepspeed (from Microsoft) – Deepspeed-ZeRO – scales any model size with zero to no changes to the model. Integrated with HF Trainer.
- fairscale (from Facebook) – Implements ZeRO protocol as well. Integrated with HF Trainer.
- ColossalAI (from Hpcaitech) – A Unified Deep Learning System for Large-Scale Parallel Training (1D, 2D, 2.5D, 3D and sequence parallelism, and ZeRO protocol).
Model Compression/Acceleration
Compressing or accelerate models for improved inference speed.
- torchdistill – PyTorch-based modular, configuration-driven framework for knowledge distillation.
- TextBrewer (from HFL) – State-of-the-art distillation methods to compress language models.
- BERT-of-Theseus (from Microsoft) – Compressing BERT by progressively replacing the components of the original BERT.
Adversarial Attack
Conducting adversarial attack to test model robustness.
- TextAttack (from UVa) – A Python framework for adversarial attacks, data augmentation, and model training in NLP.
- TextFlint (from Fudan) – A unified multilingual robustness evaluation toolkit for NLP.
- OpenAttack (from THU) – An open-source textual adversarial attack toolkit.
Style Transfer
Transfer the style of text! Now you know why it’s called transformer?
- Styleformer – A neural language style transfer framework to transfer text smoothly between styles.
- ConSERT – A contrastive framework for self-supervised sentence representation transfer.
Sentiment Analysis
Analyzing the sentiment and emotions of human beings.
- conv-emotion – Implementation of different architectures for emotion recognition in conversations.
Grammatical Error Correction
You made a typo! Let me correct it.
- Gramformer – A framework for detecting, highlighting, and correcting grammatical errors on natural language text.
Translation
Translating between different languages.
- dl-translate – A deep learning-based translation library based on HF Transformers.
- EasyNMT (from UKPLab) – Easy-to-use, state-of-the-art translation library and Docker images based on HF Transformers.
Knowledge and Entity
Learning knowledge, mining entities, connecting the world.
- PURE (from Princeton) – Entity and relation extraction from text.
Speech
Speech processing powered by HF libraries. Need for speech!
- s3prl – A self-supervised speech pre-training and representation learning toolkit.
- speechbrain – A PyTorch-based speech toolkit.
Multi-modality
Understanding the world from different modalities.
- ViLT (from Kakao) – A vision-and-language transformer Without convolution or region supervision.
Reinforcement Learning
Combining RL magic with NLP!
- trl – Fine-tune transformers using Proximal Policy Optimization (PPO) to align with human preferences.
Question Answering
Searching for answers? Transformers to the rescue!
- Haystack (from deepset) – End-to-end framework for developing and deploying question-answering systems in the wild.
Recommender Systems
I think this is just right for you!
- Transformers4Rec (from Nvidia) – A flexible and efficient library powered by Transformers for sequential and session-based recommendations.
Evaluation
Evaluating NLP outputs powered by HF datasets!
- Jury (from OBSS) – Easy to use tool for evaluating NLP model outputs, spesifically for NLG (Natural Language Generation), offering various automated text-to-text metrics.
Neural Search
Search, but with the power of neural networks!
- Jina Integration – Jina integration of Hugging Face Accelerated API.
- Weaviate Integration (text2vec) (QA) – Weaviate integration of Hugging Face Transformers.
- ColBERT (from Stanford) – A fast and accurate retrieval model, enabling scalable BERT-based search over large text collections in tens of milliseconds.
Cloud
Cloud makes your life easy!
- Amazon SageMaker – Making it easier than ever to train Hugging Face Transformer models in Amazon SageMaker.
Hardware
The infrastructure enabling the magic to happen.
- Qualcomm – Collaboration on enabling Transformers in Snapdragon.
- Intel – Collaboration with Intel for configuration options.
NOTE: This list of resources are entirely from Hugging Faces Repository, I am just sharing it here as a Wiki. Please visit the repository for more