August-September Rankings Incident Postmortem
From August 22 to 23, prestigehunt.com experienced an incident that resulted in inaccurate ranking data for the next 19 days. As of yesterday, we’ve officially resolved the incident, and our rankings should now be back to normal.
We at Prestigehunt would like to take this opportunity to sincerely apologize for the impact this incident has had on each and every one of you. Prestige rankings are an irreplaceable part of our everyday lives. We rely upon them when communicating with recruiters, making career decisions, and collaborating with colleagues from other companies. You placed your trust in us to deliver accurate, unbiased rankings… and we failed you. And while we cannot undo the past, we can share the events that led to this incident and the lessons we’ve learned along the way to make sure this doesn’t happen again.
Background
Prestige rankings are stored in a free-tier Firebase database with no backups. It’s important to note that there is a manual backup workflow in place where engineers will ad hoc export the entire contents of our database to their local machine, but this workflow is largely undocumented and commonly forgotten.
Prestige rankings are crowdsourced by users like you via “head-to-head matches”, referred to internally as “H2Hs”. If you visit prestigehunt.com/h2h, you’ll be shown two random companies and asked to choose which is more prestigious. When you do so, that data is posted to our backend API, implemented as a Netlify function. This function performs some mediocre validation and then stores the data in our database. At the same time, a new H2H is displayed to you on our website. You’re also given the opportunity to skip the H2H and be shown a new one.
The posted data might look something like this:
{
"id": "id-of-head-to-head-match",
"a": "id-of-company-a",
"b": "id-of-company-b",
"winner": "a",
"timestamp": 1662857535,
}
It’s fairly trivial to learn what the IDs of various companies are by examining our client-side code. It’s also fairly trivial to construct fake H2H data and post it to our backend. As mentioned earlier, we have some backend validation in place to determine whether the posted data is “real”, but it’s not particularly robust because it was written by one of our summer interns.
Prestige rankings are not updated every time an H2H is submitted. They are updated once a day by a batch job that processes all new H2Hs in one go. This batch job runs on a hot suite of free-tier services.
Between August 22 to 23, we received a flood of spam H2H data. We know this was spam data because:
- We received about 17,000 H2Hs over the course of two days, which represented approximately a 5,000% increase in daily H2Hs.
- These H2Hs overwhelmingly had a targeted subset of companies as winners and losers. For example, IBM was a winner in a vast majority of these H2Hs. Companies chosen in H2Hs should be uniformly distributed, as H2Hs are selected at random.
This incident triggered a series of events that led to about 19 days of inaccurate ranking data. Rankings were eventually restored by manually removing all H2H data between August 22 and 23 and replaying all H2Hs to reconstruct ranking and time-series data.
Incident timeline
2022-08-22
Over 6,500 H2Hs were submitted, representing approximately a 3,150% increase in daily H2Hs. These H2Hs were manually constructed to have companies such as IBM and Amazon as the winners and companies like Jane Street and Google as the losers, resulting in some companies unexpectedly surging to the top of the rankings, and others falling to the bottom.
This information was logged to a Google sheet that our engineering team occasionally looks at to see if ranking data looks normal. No Prestigehunt engineers looked at this sheet.
2022-08-23
Over 10,500 H2Hs were submitted, largely repeating the same strategy as on 2022-08-22. Again, this information was logged to a Google sheet, but no Prestigehunt engineers opened it.
2022-08-25
The following support ticket was emailed to our Customer Support team:
I think someone manipulated the rankings. Jane street should be >> ibm.
Unfortunately, this email account is checked for new messages about once a month, and the team had rather recently checked the account. Thus, no one saw this message.
2022-08-26
An engineering lead at Prestigehunt visited the website over the weekend and noticed that IBM was at the top of the rankings. The individual quickly notified the Data Integrity department.
The Data Integrity team immediately read the incoming report, but decided against taking action until that next Monday. This decision was largely made because the team didn’t want to work over the weekend.
2022-08-28
The first observed end user report is observed by the Community Success team. The following was posted on the Blind app under Tech Industry:
PrestigeHunt wtf? Amazon #2 on the prestige list, IBM #1. We gotta get rid of this website once and for all.
After some careful analysis, the Community Success team determined that this post was an existential risk to their team. They again notified the Data Integrity team of the issue.
The Data Integrity team chose to not respond to this incoming report.
2022-09-10
A few weeks later, the Data Integrity team finally decided to take action. They identified that the quickest way to get the website back up and running was to purge all data from August 22 to 23 and replay the H2H matches from the beginning of time. After doing so, they marked the incident as resolved.
What went well
- The Data Integrity team took excellent, timely action on the incident, responding and resolving the issue in its entirety within 19 days, and staying well-within their SLO of 1 month for P0 incidents.
- The Customer Support team was pleased to eventually see that at least one user attempted to notify them of the issue. This presumably meant that users actually cared about the website.
- The Google sheet infrastructure was an outstanding tool for understanding the root cause of the inaccurate rankings.
What went poorly
TODO
Next steps
- We’re strongly considering upgrading to Firebase’s Blaze plan to set up daily backups. However, Prestigehunt is already operating at a monthly loss, so there’s debate to be had here.
- We’ve decided to adopt even more Google sheets as critical, operational infrastructure given its success in mitigating this incident.
- We will not be extending a full-time offer to the summer intern who wrote our backend H2H validation code.
- Our Fall intern class is working on implementing additional backend validation to flag and discard fake H2Hs.
Closing remarks
Our postmortem would not be complete without some baseless speculation regarding the motivation behind the attack.
According to our rankings, IBM is not a particularly prestigious company (rank 91 at the time of writing). Was this attack carried out by a disgruntled IBM radicalist, looking to “stick it to the community” once and for all? Or was this attack orchestrated by IBM itself, a tasteless ploy to recruit hapless developers in a time of market uncertainty? Who knows how many additional job applications IBM received in the 19 days they were at the top of the rankings.
To IBM leadership and engineers, we say this: The path to prestige is a long and arduous one. It’s clear that IBM’s prestige isn’t what it used to be. But change comes from within, from developing high quality software and building strong engineering teams. Attacks like these are but small bandaids on a festering wound. Focus your efforts not on manipulating public perception, but on building a more prestigious company from within.