portkey-python-sdk/examples/loadbalance_...

Hello @Vrushank | Portkey not using the SDK, directly using the Models in portkey, is it not supported for it?

Yes, it's supported!

Here's how you can do it:

Step 1: Go to the Configs page on the Portkey dashboard and create new config

Step 2: Add this snippet there, and change the API keys with your OpenAI API keys for both the different accounts:

Plain Text

{
    "retry": 5,
    "cache": "simple",
    "mode": "loadbalance",
    "options": [
        {
            "provider": "openai",
            "api_key": "sk-xxx",
            "weight": 0.5
        },
        {
            "provider": "openai",
            "api_key": "sk-xxx",
            "weight": 0.5
        }
    ]
}

Step 3: In your API request, add another header "x-portkey-config" and add the config key there. It would look something like this:

Plain Text

curl -X POST \
  https://api.portkey.ai/v1/prompts/<PROMPT_ID>/generate \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -H "x-portkey-config: pc-xxx-000"

Configs are a simple way of setting all the Portkey settings you want - so you can enable load balancing, fallbacks, caching etc through the Configs and then pass the config id header "x-portkey-config: <CONFIG_ID>" while making your request.

Check out the detailed docs on Configs: https://docs.portkey.ai/portkey-docs/portkey-features/ai-gateway/configs

Thanks @Vrushank | Portkey Actually one issue I am facing is, that we dont require cache, how can we remove that parameter as it shows missing argument

cache & retry are required currently - that's a bug and the team is pushing a fix for that. Will update you here once it's resolved!

Okay, how does simple cache work?

if lets say my prompt changes a bit, would it still retrieve the same result from cache or generate a new one?

simple cache only serves cached responses for exactly verbatim similar request body & queries.

if there's any change in the prompt (including stuff like extra space, punctuation marks etc.), it will fetch a new resposne

got it, makes sense, so the entire prompt which could be lets say 2000-3000 tokens need to be matched, if found similar, it will use the already cached response, also does it affect the latency?

Exactly! It has no impact on the latency. We compare hashes which makes it extremely fast, and for the requests that do have caches stored, they are served 20x times faster compared to fetching them from OpenAI

Got it, makes sense

Also, for the costs/user metrics - you can filter on the user param in the Analytics page right now. But it would also be interesting to show more viz/data related to costs.

@deepanshu_11 curious to know what metrics around cost would be helpful for you

@Vrushank | Portkey is there a way to test if the loadbalance is working correctly?

Yes! Look for the "x-portkey-last-used-option-index" header in the JSON view of Response logs.

The value corresponds to the position of the LLM object as defined in the Config

Attachment

You can confirm that loadbalance is working based on how the values in that header cycle!

okay nice, awesome

Would also love to know if we have plans for keyword search on the generated response?

That's a great idea. It is a part of our roadmap, including querying the logs in natural language. Maybe a few weeks I think before we get to it ⚡

Hello @Vrushank | Portkey the 3rd step you mentioned here https://discord.com/channels/1143393887742861333/1159035076345073724/1159058739643555840 is somehow not sending the request with the key x-portkey-config and not giving the response with x-portkey-last-used-option-indexkey in headers

ah strange. taking to dm

Hey @deepanshu_11 . Apologies for the delay. Thanks for pointing out. We were not sending the header for models response. We will be adding the x-portkey-last-used-option-index header in the next couple of hours.

@visarg thanks for the prompt response, can you confirm if the functionality is working?

@visarg @Noble can you check the load balance issue as I have tried it but it does not seem to work yet?

@deepanshu_11 apologies for the delay here - can confirm that the fix should be live shortly.

Thanks so much, really appreciate it

Hey @deepanshu_11 - We have pushed a fix for the issue. You should now be able to do load balance in models using configs. You will also get the x-portkey-last-used-option-index header in the response itself. And it will also be visible in the logs (JSON format). Please try it out and let us know. Thanks for the patience.

PS: We are also modifying our logs UI soon to increase the visiblity of loadbalance, fallback, etc. and make it more readable. Post that, it will become easy to find details in the logs section.

Awesome, so good to hear. Yeah I had just noticed it few sec ago, amazing feature

One question I have is, currently one account has 40k tokens and another has 10k tokens limit, usually the OpenAI error we get is regarding token usage, now if I am load balancing, still I am seeing those Rate limits error, should I add some fallback mechanism along with load balance?

What is your loadbalance split in the above case?

Currently 60:40

I'm now thinking would fallback be a better solution in this case, meaning if rate limit error on first account occurs, can we move to another account?

Few points that might help in solving this with minimal efforts:

Is the higher limit key ever hitting the rate limit? If not, then you can try increasing the split and give more weightage to the higher limit key.

All these rates limits are mostly per minute limits. You can try increasing the retry count by 1 and let portkey do the heavylifting.

Makes sense, wanted to understand if fallback approach would be better than loadbalance here and can fallback work within same provider if OpenAI gives Ratelimit error?

Yes, you can tackle rate limits with fallbacks as well!

Currently, fallbacks + load balancing both together isn’t doable

Fallbacks and Load balancing also take “retry” values where you can set individual retry count for specific errors

Okay I think I will change the config to fallback, would we be able to know if fallback was used?

Same logic as load balancing - see the index in last-used-option-index

Quite interested in seeing if this helps solve your issue - please do keep sharing feedback!