
Enhance model outputs by generating multiple responses at higher temperatures to encourage creativity, then using your scoring system to select the best candidates. This "search-then-prune" approach transforms the traditional model variance from a weakness into a strength. Creativity increases the sample space, creating great new candidates, but also many noisy ones. By subsequently selecting only the best ones, you can maintain control over quality all while benefiting from this broader exploration. The technique effectively turns incremental compute into incremental performance - sampling more candidates increases the likelihood of seeing better results, with the tradeoff being increased latency and cost.
The technical name for this technique is “Rejection Sampling” and it is part of a broader family of "search then prune" techniques that are becoming increasingly important in AI systems - the idea that we can achieve better results by first expanding our solution space, then efficiently pruning it down to the best candidates. It is an excellent approach to try to raise response quality without requiring model retraining. Understanding these principles opens doors to a whole class of advanced techniques deriving from inference time scaling and active learning.
Apply this technique when output quality is critical and you can afford additional latency or compute costs. It's particularly valuable in creative tasks where you want to encourage novel outputs while maintaining quality standards. For example, in an AI copywriting system, you might generate 10 headlines with higher temperature (encouraging creativity) and then select the one that best balances originality with brand voice according to your scoring system. Similarly, for complex reasoning tasks, generating multiple solution attempts and selecting the most logically sound one can significantly improve reliability. Start with small ensemble sizes (n=5-10) and scale up based on your quality requirements and latency constraints.