We're using grpc-java in production for some of our based backend system, slowly replacing our old netty/jackson based system using JSON over HTTP/1.1.
The performance is good, and it's nice to have proto files with messages and services, which acts both as documentation and a way to generate client and server code. Protobuf is much faster, produces less garbage and is easier to work with than JSON/jackson. The generated stubs are very good and it's easy to switch between blocking and asynchronous requests, which still only require a single tcp/ip connection.
We've had two performance problems with it:
1. Connections can die in a somewhat unexpected way. This turned out to be caused by HTTP/2.0 which only allows 1 billion streams over a single connection. Maybe not a common issue, but it hurt us because we had a few processes reaching this limit at the same time, breaking our redundancy. It's easy work around it, and I believe the grpc-java team has plan for a fix that would make this invisible to a single channel.
2. Mixing small/low-latency requests with large/slow requests caused very unstable latency for the low-latency requests. Our current work-around is to start two grpc servers (still within the same java process and sharing the same resources). The difference is huge with 99p going from 22ms to 2.4ms just by using two different ports. Our old code with JSON over HTTP/1.1 implemented using jackson and netty didn't suffer this unstability in latency, so I suspect grpc is doing too much work inside a netty worker or something. I haven't yet tested with grpc-java 1.0, which I see has gotten a few optimization.
Still, these have been minor issues, and we're happy so far. The grpc-java team is doing a good job taking care of things, both with code and communication.
> This turned out to be caused by HTTP/2.0 which only allows 1 billion streams over a single connection.
Hilarious. People called this issue out as an obvious flaw when HTTP/2.0 was first proposed, got ignored, and here the issue is.
For those unfamiliar:
HTTP/2.0 uses an unsigned 31-bit integer to identity individual streams over a connection.
Server-initiated streams must use even identifiers.
Client-initiated streams must use odd identifiers.
Identifiers are not reclaimed once a stream is closed. Once you've initiated (2^31)/2 streams, you've exhausted the identifier pool and there's nothing you can do other than close the connection.
For comparison, SSH channels use a 32-bit arbitrary channel identifier, specified by the initiating party, creating an identifier tuple of (peer, channel). Channel identifiers can be re-used after an existing channel with that identifier is closed.
As a result, SSH doesn't have this problem, or the need to divide the identifier space into even/odd (server/client) channel space.
Well, that's half of the downside presented, the other half is that it's split be server/client connections. I assume this was done because it simplifies the tracking of the next stream identifier, because you can just keep a counter and increment, rather than a table of used streams to check a new random identifier against?
So, that's the other side of the argument? I assume there was at least a reason they specced it this way originally, even if under comparison those reasons wouldn't have held up. Was there any justification, or was it literally ignored?
The performance is good, and it's nice to have proto files with messages and services, which acts both as documentation and a way to generate client and server code. Protobuf is much faster, produces less garbage and is easier to work with than JSON/jackson. The generated stubs are very good and it's easy to switch between blocking and asynchronous requests, which still only require a single tcp/ip connection.
We've had two performance problems with it:
1. Connections can die in a somewhat unexpected way. This turned out to be caused by HTTP/2.0 which only allows 1 billion streams over a single connection. Maybe not a common issue, but it hurt us because we had a few processes reaching this limit at the same time, breaking our redundancy. It's easy work around it, and I believe the grpc-java team has plan for a fix that would make this invisible to a single channel.
2. Mixing small/low-latency requests with large/slow requests caused very unstable latency for the low-latency requests. Our current work-around is to start two grpc servers (still within the same java process and sharing the same resources). The difference is huge with 99p going from 22ms to 2.4ms just by using two different ports. Our old code with JSON over HTTP/1.1 implemented using jackson and netty didn't suffer this unstability in latency, so I suspect grpc is doing too much work inside a netty worker or something. I haven't yet tested with grpc-java 1.0, which I see has gotten a few optimization.
Still, these have been minor issues, and we're happy so far. The grpc-java team is doing a good job taking care of things, both with code and communication.