A few of our users occasionally spin up pods that do a lot of number crunching. The front end is a web app that queries the pod and waits for a response.
Some of these queries exceed the default 30s timeout for the pod ingress. So, I added an annotation to the pod ingress to increase the timeout to 60s. Users still report occasional timeouts.
I asked how long they need the timeout to be. They requested 1 hour.
This seems excessive. My gut feeling is this will cause problems. However, I don’t know enough about ingress timeouts to know what will break. So, what is the worst case scenario of 3-10 pods having 1 hour ingress timeouts?
I asked how long they need the timeout to be. They requested 1 hour.
That’s outright insane. Does this mean that if they connection has any type of hiccup, all their work is lost?
Instead of having web apps working directly out of request-response cycle, these long running jobs need to be treated as a proper separate task, which gets a proper record entry in their database and could be queried for the results later.
It depends. If it’s an internal facing cluster with little other traffic then it’s probably fine. If it’s a public facing cluster with NAT then you risk the possibility of exhausting the number of ports for open connections.
If the frontend reliably closes connections when done, then it’s probably fine to just set a 1h timeout. If you run into the problem of clients leaving idle connections open then you may want to consider setting an idle timeout, and then have the client send keepalive packets to the backend, websocket style.
There’s a lot of smaller and bigger possible problems with that, but I think there’s only one way to find out if those can become actual problems.
Try it, and report back.