Business runs on communications. Customers reach out when they need something.
Colleagues connect to get work done. At Re:infer, our mission is to
fundamentally change the economics of service work in the enterprise—to unlock
the value in every interaction and make service efficient and scalable. We do
this by democratising access to state of the art NLP and NLU.
Specifically, Re:infer models use deep learning architectures called
Transformers.
Transformers facilitate huge improvements in NLU performance. However, they are
also highly compute intensive—both in training the models to learn new concepts
and using them to make predictions. This two part series will look at multiple
techniques to increase the speed and reduce the compute cost of using these
large Transformer architectures.
This post will:
- Present a brief history of embedding models in NLP.
- Explain why the Transformer’s self-attention mechanism has a high
computational workload.
- Review modifications to the traditional Transformer architecture that are more
computationally efficient to train and run without significantly compromising
performance.
The next post, will look at
additional computational and approximation techniques that yield further
efficiency gains. The next post will:
- Explore distillation techniques, where smaller models are trained to
approximate the performance of larger models.
- Explain efficient fine tuning techniques, where parameter updates are
restricted.
- Provide our recommendations for when to use each of these methods.