Welcome to Portkey Forum

Updated 10 months ago

Not all Llama 2's are interferenced equal

Not all Llama 2's are Created Equal

Llama 2 models from different providers like Together, Anyscale, Perplexity, etc may often seem identical on paper, but the same query on the same Llama 2 model can yield different responses depending on the inference provider.

Why? It often comes down to how the provider handles the model. For instance, some might use quantization to make the model run faster and consume less resources, but this can subtly alter the quality of the output.

I recently read this blog post by Together AI about their inference engine, and importantly, their chart comparing their inference performance to the vanilla HuggingFace implementation.

Here's a snippet from their blog:
"The improvements to performance with the Together Inference Engine come without any compromise to quality. These changes do not involve techniques like quantization which can change the behavior of the model, even if in a modest way."

This got me thinking β€”
  • how do these subtle differences impact our work?
  • how does this affect your choice of provider for Llama 2 models?
Would love to hear your thoughts on this!
Attachment
6552504f864523290a4fd7ac_Quality.png
S
V
2 comments
Is it safe to assume vanilla version is that one on the huggingface?
Add a reply
Sign up and join the conversation on Discord