I Shipped an Algorithm That Could Lose People Real Money — Here's How I Slept at Night
On building a portfolio allocation engine for ₹300Cr+ in live money, the shadow mode that saved us, and the estimation disaster I own.
There's a specific kind of fear that hits you when you're about to deploy code that controls how real money moves. Not "oops, the page crashed" money. Hundreds of crores. Thousands of accounts. Options trades where a partial execution doesn't just lose money — it leaves someone with an unhedged position and theoretically unlimited risk.
I built the portfolio allocation engine at a Y Combinator-backed fintech that automated options trading. This is the story of how I shipped it without blowing up anyone's account — and the estimation disaster that almost derailed the whole thing.
The Problem: One Portfolio Wasn't Enough
Users on our platform could only invest in a single trading strategy. If that strategy had a bad week, their entire capital took the hit. No diversification. Churn was climbing.
The product ask was straightforward: let users invest across multiple model portfolios simultaneously. Spread risk. Reduce churn.
The engineering reality was anything but straightforward.
The Constraint That Changed Everything
Options trading has a property that makes it fundamentally different from, say, allocating money across mutual funds. Most options strategies involve hedged positions — a spread, for example, where you buy one contract and sell another. The two legs protect each other.
If you only execute one leg, the user is left holding a naked position. Depending on the market, that exposure can be unlimited.
This meant the allocation algorithm had a hard constraint: either a user trades ALL their assigned strategies for the week, or they trade NONE. There's no "let's do 3 out of 5 and skip the rest." Partial allocation is financially dangerous.
I remember sitting with this constraint and thinking — this isn't a software problem anymore. This is a "someone's life savings" problem.
The Two-Pass Algorithm
If you squint at this problem from an algorithms perspective, it's a variant of the 0/1 Knapsack problem with an all-or-nothing constraint. Each user has a "knapsack" (their available capital) and a set of "items" (strategies, each with a capital cost). The twist: unlike classic knapsack where you maximize value by selecting a subset, here the constraint is you must fit ALL items or take NONE. There's no partial selection. This simplifies the optimization (no need to find the optimal subset) but makes the feasibility check critical — get it wrong, and someone has an unhedged position.
More formally, this is a constraint satisfaction problem (CSP): given a user's capital state and a set of strategy requirements, determine if a valid assignment exists where all strategies are funded simultaneously. Pass 1 is the feasibility check. Pass 2 is the assignment.
The allocation engine ran as a Cloud Scheduler-triggered daily job. Every morning before market open, it processed all 2,500+ accounts and decided: who trades today, and how much capital goes where.
Pass 1 — Eligibility (Constraint Satisfaction)
For each user, compute projected available capital:
projected_available = current_balance
- capital_locked_in_open_positions
- margin_reserved_for_pending_orders
+ capital_releasing_from_today's_expiriesThen sum up the capital required across ALL assigned strategies for the week. If projected_available < total_required_across_all_strategies, the user is marked ineligible. They sit out the entire week. No partial bets.
This is the CSP feasibility check — can all constraints (strategy capital requirements) be satisfied simultaneously given the available resources (user capital)? If not, the entire assignment is rejected. No greedy "fit as many as possible."
The tricky term is capital_releasing_from_today's_expiries. A position expiring today could settle in profit (releasing more capital than expected) or in loss (consuming more). At the time the algorithm runs — early morning, before market open — we don't know the settlement outcome yet. I initially used last closing price to estimate this. Shadow mode later revealed this was insufficient (more on that below).
Pass 2 — Capital Distribution (Round-Robin Scheduling)
For eligible users, capital is distributed across their assigned strategies using a pattern borrowed from round-robin scheduling — the same algorithm OS schedulers use to allocate CPU time across processes, adapted here for capital allocation across strategies with temporal constraints:
for each eligible user:
strategies = user.assigned_strategies # sorted by execution_window
remaining_capital = user.projected_available
for each strategy in strategies:
required = strategy.compute_capital_requirement(user)
if remaining_capital >= required:
allocate(user, strategy, required)
remaining_capital -= required
else:
# This shouldn't happen — Pass 1 already verified total affordability
# If it does, it's a bug. Log, alert, skip user entirely.
rollback_all_allocations(user)
breakThe round-robin ordering matters because strategies have different execution windows. A strategy that trades at 9:30 AM needs capital allocated before one that trades at 2:00 PM. The sort ensures capital is committed in execution order — same principle as priority scheduling with deadlines, where time-constrained jobs get resources first.
The entire job — Pass 1 + Pass 2 for 2,500 users — completes in under a minute. It's CPU-light; the work is mostly database reads (balances, positions, strategy configs) and in-memory computation. Writes are the allocation decisions themselves, persisted to the database and picked up by the execution pipeline when trading windows open.
The algorithm itself wasn't the hard part. The hard part was trusting it.
I Didn't Trust My Own Code
We were managing ₹300Cr+ in real transaction volume across 2,500+ accounts. I'd written the algorithm, tested it, reviewed it. But I couldn't bring myself to flip the switch and let it control real trades on day one.
So I built shadow mode — also known as dark launching or shadow traffic testing in the broader industry.
If you've worked with deployment strategies, you've probably seen blue-green deployments (two identical environments, traffic switches atomically), canary releases (route a small percentage of traffic to the new version), and feature flags (toggle behavior at runtime). Shadow mode sits in a different spot on this spectrum. Unlike canary, where the new code serves real traffic for a subset of users, shadow mode has zero production impact. The new algorithm processes real inputs but its outputs go nowhere — they're logged, not acted upon. The old algorithm remains fully in control.
The reason I chose shadow mode over a canary approach: this isn't a system where "5% of users get the new version" is safe. In a web app, if the canary serves a bad page to 5% of users, you roll back and those users refresh. Here, if the new allocation is wrong for even one user, they could end up with an unhedged options position. The cost of a single bad decision isn't a degraded experience — it's real financial loss. Shadow mode was the only deployment pattern where I could validate against real production data with zero risk.
The idea is simple: run the old algorithm and the new algorithm simultaneously. The old one controls real trades. The new one runs silently, logs every decision it would have made, and writes the output to a separate store. Every morning, I'd compare the two outputs manually.
We ran shadow mode for 2–3 weeks. And it caught things I wouldn't have found in testing.
The timing edge case: A user's balance was technically sufficient at the time the algorithm ran. But they had positions expiring later that same day. Depending on whether those positions settled in profit or loss, the balance could swing either way. The new algorithm was saying "eligible" based on a snapshot, but by the time trades actually executed, the numbers had shifted. Shadow mode surfaced this because I could compare "what the algorithm decided at 8 AM" against "what actually happened by 3:30 PM."
The rounding problem: Small-balance users near the eligibility threshold were flickering between eligible and ineligible across runs due to floating-point rounding in the capital calculation. Not a bug you'd catch in a unit test with clean numbers.
Both of these would have caused real problems — either users trading when they shouldn't, or users being incorrectly blocked. Shadow mode was the safety net that let me find them without anyone's money on the line.
After 2–3 weeks, I gradually migrated users — effectively moving from shadow mode to a staged rollout, shifting user cohorts from the old algorithm to the new one. This is the standard canary progression, but only after shadow mode had already validated correctness.
The Estimation Disaster
Here's the part I'm less proud of.
This project was part of a larger rewrite — let's call it the Bundle Service v3. Original estimate: 1 month. One week for the tech spec, one week for development, one week for QA, one week for release.
It took 2.5 months.
Three things I didn't anticipate:
1. Cross-team blast radius
The allocation engine didn't exist in isolation. It touched the onboarding service, the PnL generation pipeline, the analytics system — roughly 5 dependent services owned by different people. I estimated as if I was only changing my service. I wasn't.
2. Legacy code tax
The existing system had deprecated data models, dead code paths, and flows that weren't idempotent. This was the real blocker for shadow mode. Idempotency — the property that running an operation multiple times produces the same result as running it once — is a fundamental requirement in distributed systems (you'll find it in any discussion of exactly-once semantics, Kafka message processing, or payment systems). In our case, the old allocation flow had side effects: it mutated shared state as it ran, so running it twice could produce different results or double-count allocations. You can't safely run old and new algorithms in parallel if the old one has side effects you don't fully understand. I had to make the existing flows idempotent first — adding idempotency keys, making operations safely re-runnable, ensuring that re-processing a user didn't create duplicate allocations — before I could even begin the shadow mode comparison. Cleaning this up took weeks I hadn't budgeted for.
3. Shadow mode wasn't in the original plan
When I sat down and really thought about deploying to ₹300Cr+ in live money, I realized I couldn't skip the parallel validation phase. It was the right call — but it added significant time, and I hadn't scoped it upfront because I hadn't fully internalized the risk when I gave the initial estimate.
The 2.5x overrun was my estimation failure. I own that.
What I Changed After This
I stopped estimating before I understood the full blast radius. For any cross-service change now, the first step is mapping every impacted service and talking to its owner before committing to a timeline. This is where the tech spec practice came from — I started writing detailed design documents for every feature, getting alignment in review before writing a line of code.
I added a "legacy tax" to every estimate. In a fast-moving startup, there's always more legacy than you think. If a feature touches old code, I add buffer. Always.
I plan the rollout before the code. Shadow mode should have been in the original estimate. For anything touching money or critical paths, the deployment strategy is now part of the spec, estimated alongside development — not added as an afterthought.
I escalate early with data. When I realized at the one-week mark that the scope was bigger than expected, I went to my manager with specifics: here are all the services impacted, here are the touch points, here's why shadow mode is non-negotiable for a system handling this much money, and here's the revised timeline. Because I showed up with data instead of vague "it's taking longer," the extended timeline was understood — not a surprise.
The Patterns Behind the Design
Looking back, this project was a crash course in applying well-known engineering patterns to a domain where the cost of getting them wrong is measured in money, not error rates.
Constraint Satisfaction / 0-1 Knapsack — Pass 1 eligibility check. Framing the problem correctly — this is feasibility checking, not optimization. All-or-nothing means you reject early instead of trying to find the "best partial fit."
Round-Robin Scheduling — Pass 2 capital distribution. Borrowed from OS process scheduling. Temporal ordering (execution windows) maps directly to deadline-based priority scheduling.
Shadow / Dark Launch — Pre-production validation. Zero-risk validation against real data. Chosen over canary because even one bad allocation has unbounded financial cost.
Staged Rollout / Canary — Post-shadow migration. After shadow mode validated correctness, users were migrated in cohorts — the standard canary progression.
Idempotency — Legacy code cleanup. Required before shadow mode could work. Operations needed to be safely re-runnable — same principle behind idempotency keys in payment APIs.
None of these patterns are novel. They're textbook. But knowing when to apply which pattern, and understanding why shadow mode was necessary here but a canary might suffice elsewhere — that's the engineering judgment part that doesn't show up in a design patterns book.
The Outcome
The allocation engine shipped. Multi-portfolio investing went live across 2,500+ accounts. No user money was impacted during the rollout. Shadow mode caught real edge cases that would have caused problems. Churn reduced.
And I learned that the scariest part of shipping critical software isn't the algorithm. It's the moment you realize your estimate was wrong, your scope is blowing up, and people's money is on the other side of your deploy button.
The algorithm was the easy part. Earning the right to trust it — that was the real engineering.
I'm a Senior Backend Engineer with 6+ years of experience building distributed systems in Go and Python. I write about the messy reality of backend engineering at mrinal.dev.






