Database part 1

12/16/2023

We’d start with N=2, with two clusters each storing half of our data, but thereafter could easily scale out to an arbitrary number of databases as needed. We proposed to split our database into multiple shards, each of which would store a subset of our data. We decided, therefore, that the best course of action would be to spread the load out over multiple databases - still a lot of work, but at least this would be for a huge long-term gain. And hardware-wise we were nearing the limit of what we could achieve with commodity hardware, since there’s only so many CPUs you can fit into a single machine. Over the past few years we’d already gotten to the point of diminishing returns on the query optimization, where all the low-hanging fruit had been properly massaged, indexed, or removed, so that now any further query optimization would be a lot of work for comparatively small gains. When you have a single database you only have two options for scaling: either reduce the number and latency of the queries you’re making, or buy bigger server hardware. Contention for I/O time was also a cause of issues, since multiple I/O-bound queries hitting different pages from the disk at once would often leave other queries just waiting for a chance to read from the disk, causing further pileups. (Our customers range from small single-person law offices to huge firms with gigabytes of data, so the pain we feel from a particular query depends a lot on who’s asking.) The result was poor performance, as a batch of the slower queries could cause the faster queries to pile up behind them, waiting for a chance at one of the CPU cores. This sort of load would be workable if it were 12,000 reasonably uniform queries per second, but our queries tend to vary widely in terms of processing time, result set size, input/output time, and temporary space required. Our database was handling around 12,000 queries per second in early 2018, and the concurrency was already becoming problematic. This is fine for most circumstances, but under heavy loads you’ll end up with a very long run queue on the database server. MySQL has a fairly simple concurrency model: each query is processed by a single OS-level thread. But by early 2017 it was clear that we couldn’t scale this approach any further.

This was simple and convenient you knew where everything was supposed to go, you could join any data to any other data, and monitoring and maintenance were fairly straightforward. Sure, there were some additional data stores - memcached for caching, Redis for some small things, Elasticsearch for search queries, S3 for large blobs - but the vast majority of our data lived in a single MySQL cluster. The problemįor the past ten years the entire Clio app has run on a single database. We did a lot of things right and some things wrong, so hopefully others can learn from our experiences. Now that we’ve wrapped it up and we’re in the tidying-up phase, it seems like a good time to start talking about how it went and reflecting on the hard-won lessons we’ve learned. Here at Clio we’ve spent over a year working on a project to modernize our database infrastructure.

0 Comments

Database part 1

Leave a Reply.

Author

Archives

Categories