Java JPA Paging Fiasco

Java JPA Paging Fiasco

I am a programmer by trade, but I have many job titles including software engineer, software architect, software developer, etc… In a nutshell, I write code, and I do it very well. I have a couple of design principles that always come into play when I write my code, and 99% of the time they hold up well. This is a story about the other 1% of the time where they completely backfire and nearly take down our production service.

I received a new task for ingesting a file, mapping it to a database table, and providing an API to the frontend team to be able to query the data through our service. After doing a bit of research, I found out that another team was already ingesting the same file. There wasn’t a whole lot of communication across teams, but this was an amazing opportunity to reduce the duplication of effort among teams. This means that I didn’t have to create the ingestion pipeline and map it to a database table, because they were already doing it.

I reached out to the team to ask them if they were willing to create an API to make the data acessible, and they replied that they wouldn’t be able to do it for about 6 months. This was too much time, so I asked them if they were willing to let one of our engineers develop in their codebase to create the API so we could accomplish our tasks. They agreed on the condition that one of their engineers approved the code changes before it was merged into the master branch. In addition, this team was the data team. The database that held the data also held a lot of other data across many other applications.

For a bit of context, their application was running in a Kubernetes environment. In a production backend web service, you likely shouldn’t have a single computer that backs your service logic. A report on Google’s energy usage in 2011 suggested that they had 900,000 computers backing their data centers. Kubernetes, based off an internal project at Google called Borg, is an application that was built by Google and released publically. Kubernetes allows you to deploy multiple instances of your application across multiple computers. It does application orchestration such that if one application goes down unexpectedly, it will attempt to start it back up, and if you start getting heavy client usage, it will attempt to create additional application pods to handle the influx in traffic.

For the API design, it was expected that two subsequent requests from the same client would go to two separate application pods in the Kubernetes environment. In addition, the amount of data that would be in the table was in excess of over 18 gigabytes. Access to this data needed to be both distributed and paginated. Paginated data means that you select a subsequence of data in a single request. Having a client download 18 gigabytes of data in a single request would take an excessive amount of bandwidth, and it’s likely their browser wouldn’t be able to hold that much data, and would provide a very poor user experience. By paginating the requests, a user could request page 1 of about 50 records from pod 1, and the same user in a subsequent request could request page 2 of the next 50 records from pod 2.

For the implementation, I decided to use the the Spring Boot JPA Paging and Sorting Repository interface. This wraps up a lot of the implementation logic involved in querying the database by writing a single line of code to be able to access the data. It’s generally a well accepted way of paginating the data, and without a doubt one of the best ways of implementing this API. I wasn’t required to do any write operations to the database as that data pipeline was already setup, I just needed to be able to read from the database. Within no time, the API was implemented, tested, reviewed, approved, and shipped to production.

Two months later, the frontend team was ready to flip the feature flag on to allow the querying of this data from the API that I had designed. After about 2 hours of being active, the database that was getting queried was reporting very high CPU usage, very high memory usage, and was reporting slowness across the entire platform. This was oblivious to me because I neither maintained the application for which I developed the API nor the database that it was accessing. The team had seen very high query volume that was pointing to the database queries accessing the table that I had developed the API for. They flipped the feature flag back off, and noticed that all of the resource pressure had gone away, and rightfully concluded that the issue was the API that I had developed.

I developed the API to specification, almost textbook. I tested the API to ensure the pagination worked correctly, and to ensure the queries on the backend were working correctly. The code was reviewed by the team that owned the project. The feature underwent QA testing by multiple teams before being released. What went wrong?

Well, there are a few things that went wrong. First, the data warehouse was not running a typical database. The controller that underlies the JPA was able to make successful connections to the database, but the database is built in a way to handle large volumes of data, and distributing the data by sharding out portions of the database in unique ways unlike other database tables. You cannot create indexes or primary keys on this database. One of the other problems is that we used a proxying service to query the data. The request would come from the website and hit service 1, to query service 2, to query service 3, which queried the database. Typically when querying a service, you open a connection. The process of opening a connection has a bit of overhead like ensuring certificates are valid, and establishing a secure connection. Then subsequent calls on the same connection don’t have that overhead. This is useful in high performance service meshes to cut down on latency. The connection was being cut and re-established along this execution path.

One of the byproducts of cutting the connection along the execution path is that the database would drop the cursor location for the query, and this was ultimately the biggest performance hindrance of the implementation. This wasn’t seen because when I was doing the local testing, I established a direct connection to the database, and didn’t drop my cursor location when doing my paginated queries. When the cursor is dropped, the database needs to establish a new cursor position when selecting a specific page. For example, if you were trying to query page 500, and didn’t have a cursor position, the database would attempt to query pages 0-500, and only return page 500. This is 500 times as expensive on the database as it should be, and ultimately what caused the performance issues on the database. With a little bit of restructuring, we were able to get the cursor to hold its position, and re-enable this feature.