It looked like a bot swarm. It was one bad query plan.

At 9 PM the site went sideways: CPU 89%, SOS_SCHEDULER_YIELD topping the SQL wait stats, slow-request count climbing. Forty minutes in I was mentally drafting a blog post about a second bot swarm from a new operator, already assembling evidence to confirm a story I had not yet questioned.

There was no swarm. One cached query plan in IPFilterLog had been running a clustered scan on every request for hours, and I almost did not look for it because I was busy assembling the wrong narrative.

The first story was already pulling

Watch-Site caught it within a minute, and the headline counters matched a swarm: no blocking, no memory pressure, no disk wait, just CPU contention under load. Three days earlier I had spent a morning blocking five /19 ranges from a German proxy reseller that produced exactly this signature, and the agent session was still warm from that incident.

I opened a Claude Code session, dropped in the new incident bundle, and asked for a read. The response landed on traffic-driven load pattern, possibly a new swarm, recommend checking W3SVC2 and W3SVC9 logs across all three IIS sites. Right approach, wrong starting assumption. The prior session was still in context, so Claude had seen the proxy reseller incident, the same wait signature, the same CPU-to-firewall-rule pipeline, and it reasoned forward from that pattern before checking whether the pattern actually fit. I would have made the same call without a deliberate gate forcing the distribution check first.

I pulled the W3SVC2 and W3SVC9 logs for the past hour and grouped source IPs by /16 netblock. The distribution was normal: no block dominating the count, no roster-enumeration URL pattern, no UA homogeneity. Just regular traffic at elevated volume, the kind of spike you get when something slow becomes popular.

Not a swarm. The question shifted. Instead of who is attacking us, it became why is the database burning CPU on normal traffic.

What IPFilterLog was doing

Every request on the site runs a bot-filter check against IPFilterLog: a single SELECT to see whether the source IP is blocked or flagged. Simple lookup, costs nothing noticeable at normal traffic volumes. Scale it to several hundred requests per second and the plan it uses starts to matter.

I pulled sys.dm_exec_cached_plans joined against sys.dm_exec_query_stats and sorted by total worker time. One statement appeared twice in the plan cache under two different plan handles, same text, two very different execution profiles. The fast plan averaged a fraction of a millisecond with a handful of logical reads per execution. The slow plan averaged orders of magnitude more, thousands of logical reads, a clustered index scan. The slow plan was winning the majority of executions.

This is parameter sniffing. When SQL Server first compiled the statement, it built a plan optimized for the specific parameter value in use at compile time. That value had high cardinality in IPFilterLog, meaning lots of matching rows, so the optimizer chose a clustered index scan: efficient for large result sets, catastrophic when applied to IPs that match nothing at all. The plan got cached. Every subsequent IP lookup, including first-time IPs with zero filter history, was running a full clustered scan instead of a seek that would terminate in microseconds.

The box was not under attack. It was paying the cost of the wrong plan on every single request at scale.

The one-line fix

The immediate fix for parameter sniffing, when you can’t afford to wait for natural plan eviction, is OPTION (RECOMPILE). Force a fresh compilation on each execution using the actual runtime parameters. More CPU per individual execution, but a correct plan instead of a catastrophic one.

I opened SSMS, found the IPFilterLog lookup in the data access layer, added OPTION (RECOMPILE), pushed the change. Hotfix 3.350.16, tagged on master, cherry-picked to develop.

The post-deploy sample showed roughly double the request rate at half the CPU. The slow-request counter for the trailing 60 seconds came back to baseline within a minute.

The narrative gravity problem

Here is the part that bothers me. I knew the checklist for “is this a swarm” before I started the investigation. I had written it down explicitly after the German proxy reseller incident three days earlier: asset mix, UA homogeneity, Referer presence, traffic distribution. Those are the signals. CPU load and wait type are not.

The checklist was in agent memory. It was in a feedback file I wrote myself. I did not run it until forty minutes in, when I caught myself drafting the wrong blog post and asked Claude to check distribution instead of IPs.

Claude made the same error for the same reason. The agent is not immune to narrative gravity when the prior session is still in context and the surface signals match. Neither am I, even with the explicit countermeasure already written down and sitting in memory. The failure mode worth naming is not “I made a mistake.” It is “I had a narrative that fit the evidence well enough that I stopped generating alternatives, and the checklist that exists specifically to prevent this did not fire on its own.”

The checklist needs a trigger, not just a presence. Memory that does not surface itself at the moment of decision is decoration.

The open question I am carrying

OPTION (RECOMPILE) on a statement running thousands of times per minute means thousands of recompiles per minute. On modern hardware the per-compile overhead is small but not zero, and it is not the right long-term answer. The correct fix is probably a plan guide or a targeted filtered index that lets the optimizer choose seek vs. scan based on actual row distribution rather than compile-time assumptions. I have not built that yet. HQ-90 carries the followup; the recompile hint is a one-line hold so the incident does not recur while I figure out what the right index shape is.

I am labeling it as a temporary fix on a list of temporary fixes, which is fine. The site is running. The right answer does not need to ship today for the current fix to be correct.

What does need to ship is the trigger problem. A checklist that lives in memory but only fires when I remember to invoke it has the same shape as the IPFilterLog plan that picked itself once and then ran for hours: correct on paper, wrong in production, invisible until somebody pulls the cache and looks.

It looked like a bot swarm. It was one bad query plan.

The first story was already pulling

What IPFilterLog was doing

The one-line fix

The narrative gravity problem

The open question I am carrying

Read next

Five firewall rules beat the proxy swarm, but the real bug was my blind spot

The exit survey that overwrote a live mailer

A 25-second admin screen usually means you trusted the wrong query

Follow the build