r/ProductManagement Aug 13 '21

"The worst day of my life."

This was one of our developers summing up this day's main "achievement." My team has just accidentally crashed our entire tech platform powering 700+ retail units in 10+ countries.

For some time, the system was down, and we couldn't accept orders from customers. Which is, as you can imagine, a big deal for a retail company.

Were we rolling out some major update? Nope.

My small team was just trying to add a small feature to the company website: live counters for certain items we're selling.

We wanted to kill two birds with one stone here:

1️⃣ show some of our best products on the company website;

2️⃣ promote our proprietary tech which is powering the business — by showing how many of those products being promoted on the website we've already sold today / this week.

The coolest thing: the user was supposed to be able to see how the number would be going up with every new order placed by customers.

But… we messed up the integration of our API with RabbitMQ (which tracks all the events in the system), and as a result, orders started coming to our website instead of our order tracker.

What a mess! It took us some time to roll it back and fix it.

The root cause here has a lot to do with the lack of proper communication. Since our team worked with RabbitMQ for the 1st time, we should have talked to the owners of this product to double-check everything.

Instead, we (my guys) mostly relied on internal documentation (which wasn't entirely clear and up-to-date).

What I've learned from this: it doesn't come naturally to many developers but they still should talk more with each other. Not just write stuff on Slack (where it's super easy to get things wrong and miscommunicate something) — but get in one room and talk face to face. Especially when they touch critical parts of the system. Especially when they do this for the first time.

Never again!

I've written this mostly for myself but decided to share it here so someone could learn from my mistakes. Hope it's helpful.

105 Upvotes

View all comments

31

u/jetf Aug 13 '21

what you should learn from this is that skipping staging environment validation is an absolutely terrible practice. You should be more vocal about wanting to validate and test releases

1

u/maxthescribbler 28d ago

Actually, we have them. Since my team works on a side project that is supposed to just get data, this happens to be an unusual case. It basically fell through the cracks