Not all Llama 2's are interferenced equal

Not all Llama 2's are Created Equal

Llama 2 models from different providers like Together, Anyscale, Perplexity, etc may often seem identical on paper, but the same query on the same Llama 2 model can yield different responses depending on the inference provider.

Why? It often comes down to how the provider handles the model. For instance, some might use quantization to make the model run faster and consume less resources, but this can subtly alter the quality of the output.

I recently read this blog post by Together AI about their inference engine, and importantly, their chart comparing their inference performance to the vanilla HuggingFace implementation.

Here's a snippet from their blog:

"The improvements to performance with the Together Inference Engine come without any compromise to quality. These changes do not involve techniques like quantization which can change the behavior of the model, even if in a modest way."

This got me thinking —

how do these subtle differences impact our work?
how does this affect your choice of provider for Llama 2 models?

Would love to hear your thoughts on this!

Welcome to Portkey Forum

Not all Llama 2's are interferenced equal