Does Yelp Need Geospatial Sharding?
Or would storing the data in "conventional" RDBMSes just do the job?
This question came up during one of my off the record events: What’s geospatial sharding good for?
Before I dive into, let me share a few things up front:
Geospatial sharding is a great algorithmic idea.
It also is a useful idea when scaling distributed systems.
In data-heavy large scale systems, such as Uber, it’s a must.
However, is it such a useful technique for simpler problems, ex. Yelp?
I can totally buy an argument that Uber or Lyft do need some form of geospatial sharding. Although, in this particular case, manually breaking the world into pieces, perhaps by license or by regulation or by city/state, would be enough.
I can somewhat buy an argument that features such as "nearby friends” do need geospatial sharding. Maintaining a large number of pairs of coordinates and quickly searching for the close ones in this dynamic space is a special kind of problem indeed.
But I doubt it that Yelp et. al. would win much by employing geospatial sharding. Simply because the win it provides may well not be worth the cost of maintaining such an index.
Let me set things straight: if you are interviewing for L4/L5, and are asked about Yelp in a system design interview, chances are, you would do just fine by attacking the problem from the geospatial angle up front. Moreover, even if you solve the problem perfectly, but choose to not mention geospatial indexes, an interviewer may well hold it against you; so, by all, means, make sure to at least drop the name of this technique.
But, seriously, why not just shard restaurants by the hashes of their IDs, store the data on each restaurant in some primary / secondary [maybe even ternary] over a [few] dozen or so Postgres instances, build indexes on lat-s and long-s in those Postgreses, and just SQL-query each shard before returning the result to the end user?
From the distributed data structures point of view, I think Yelp, as a system design interview question, has the following properties:
Restaurants are queried often, but added and removed infrequently.
When querying the data, the user hardly needs more than ~50 results.
Hotspots exist, and do move around dynamically.
Most queries are localized to a narrow region (returning ~10s of results).
Some queries are super broad (and need only the very best, cacheable, hits).
And we have a corner case to support: broad area, tight filters.
The latter case is probably the trickiest one. Imagine a query that:
Covers all the North America,
Requests a Nepalese food place,
That is open 8am next Monday,
And the owner of which is a black woman (yes, a valid filter as of 2022).
For such an extreme example, I can easily imagine there would literally be one hit over the whole continent.
(It would likely be a mistake in data entry, and a good candidate should point this out, but this detail doesn’t serve as an indulgence to not design a good data storage & retrieval system).
Still, if all our restaurants are stored, even with 2x / 3x duplication, as rows in ~25 Postgres databases, with longitude, latitude, and all other queryable attributes, plus some static rank, as columns, won’t a simple “select id from restaurants_shard where [lat-long constraints] and [other constraints] … order by static_rank desc limit” just do the job? Over a native Postgres, with no PostGIS whatsoever?
By simply SQL-querying each Postgres shard, of course, and merging the results by some custom application code, after all N, (or, perhaps, after N-2), Postgres DBs have returned their results (because we are duplicating data 3x, with primary / secondary / ternary DB locations per restaurant, for redundancy reasons).
Because it sure is not a solution most interviewers would expect. And both manually partitioning data (as opposed to using MongoDB / Cassandra / DynamoDB / consistent hashing), as well as using Postgres DBs, are both big red flags in “interview-grade” system design these days.
But, unless I’m missing an elephant in the room, I still think it’s a great design. It also happens to be what I can see myself putting together end-to-end over a weekend, in a way that would remain maintainable five years later, by the folks who have not seen our solution before. Brutally effective is what comes to mind first.
And, as an interviewer, as long as the candidate can explain the pros and cons of the above solution, I’d give them a “strong hire”. Even if they neglect to mention the geospatial solution (although I would then ask what if we are talking about food trucks that move and go offline and online far too often compared to brick and mortar restaurants).
Bonus questions if you do like the design: Name the top one, or top three, problems with running it in production, and suggest ways to deal with them.
Note: I deliberately avoided talking about the algorithmic techniques of geospatial sharding in this post. Just because it would make it unnecessarily longer, while my goal is to keep the thought focused on system design.