Hey guys, let's say i want to serve self hosted LLAMA 3.1 405B on 2 different VMs with H100s, I use vLLM + ngrok alternative btw.
I have tested Loadbalancing mode, works fine. with this config:
{
"strategy": {
"mode": "loadbalance"
},
"targets": [
{
"provider": "openai",
"custom_host": "https://llama-1.tunnels-dev.io.systems/v1",
"api_key": "Bearer dummy-api-key-vm1k",
"weight": 0.4,
"override_params": {
"model": "meta-llama/Llama-3.1-405B-FP8"
}
},
{
"provider": "openai",
"custom_host": "https://llama-2.tunnels-dev.io.systems/v1",
"api_key": "Bearer dummy-api-key-vm2",
"weight": 0.6,
"override_params": {
"model": "meta-llama/Llama-3.1-405B-FP8"
}
}
],
"cache": {
"mode": "simple",
"max_age": 60000
},
"retry": {
"attempts": 3,
"on_status_codes": [
404,
429,
500,
520
]
}
}
But i want to achieve one more goal here, in case all retries to llama-1 (target-0) fails, i need to fallback to target-1.
basically i want a mode called
loadbalance-fallback from the Docs, looks like there is such hack like
this, but are there any better ways ?