Researchers Boost AI Training Speeds by Optimizing Processor Use

Key Takeaways

1. Developing large language models requires significant computing power, with rollout taking up to 85% of execution time.
2. The “Taming the Long Tail” (TLT) system uses an adaptive drafter model that learns on unused processors to predict outputs quickly.
3. TLT continuously adjusts its drafter during training, avoiding the need for extra computational resources.
4. An integrated adaptive rollout engine improves efficiency by maintaining a memory-efficient collection of graphs and optimizing decoding methods.
5. TLT speeds up end-to-end training times by 70-110% while maintaining accuracy and providing a high-quality draft model.


Developing large language models that can reason, perform advanced programming, and plan over multiple steps takes a lot of computing power. In the usual reinforcement learning setup, these models create many possible answers to find the best one. This part of the process, known as rollout, can take up to 85% of the total time needed to execute. This creates a significant delay because processors that finish short answers have to wait for those dealing with longer questions to finish.

New System to Fix Delays

To tackle this problem, a team from the Massachusetts Institute of Technology, along with partners from industry and academia, has created a system called “Taming the Long Tail” (TLT). This method employs an adaptive drafter model that keeps learning on processors that are not being used. This smaller model quickly predicts the future outputs of the larger target model, which then checks all these predictions at the same time using a method known as speculative decoding.

Continuous Improvement

Unlike typical speculative decoding that uses a fixed drafter which can quickly become outdated with ongoing training updates, the TLT system consistently adjusts the drafter during training without needing extra computational resources. An integrated adaptive rollout engine makes the process even better by keeping a memory-efficient collection of previously captured graphs and smartly choosing the best decoding method for each new set of inputs.

Tests on various reasoning models show that this effective solution speeds up end-to-end training times by 70-110% in comparison to leading systems. It maintains original accuracy levels and provides a high-quality draft model as a free side benefit, making this approach a very efficient way to decrease the energy and cost challenges involved in creating advanced artificial intelligence systems.

Source:
Link


 

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *