Triage
July 23, 2024

Code Reviews Are Not Effective At Finding Bugs

by Mark Greene

Share:

Intro

One of the biggest time sinks in software development is code review and approval. As such, it’s fair to examine it and have long-held beliefs regarding its merits challenged. Unfortunately, review & approval have been conflated. PR approval is a primary driver of the delays we typically see at this step.

Microsoft examined their own teams and found two eye-opening outcomes. First, the yield on finding bugs in reviews was about 15% of code review comments defects despite it being the top motivation. Second, the social aspects of code reviewing, including team hierarchy and reviewer experience, can significantly influence the outcome of the review. 

Building upon Microsoft’s analysis, Shepherdly is releasing similar insights but through the lens of objective risk, measured by its risk score. At a high level, our data confirms that code review is not as effective at finding bugs as we’d like it to be despite additional effort from engineers. This is not to say that code reviews should cease. Instead, this post is advocating that code reviews should be decoupled from approval governed by an objective risk measure.

Evolving code review to this state would allow teams to bifurcate their approval policy such that low-risk changes would not need to sit and wait for a ‘LGTM’ review and the riskiest PRs could have concentrated resilience and mitigation effort. 

Coders get less busy work like reviewing inconsequential PRs and reduced context switching. Managers get risk trade-off observability and the precision to dial in risk & velocity trade-offs that make sense for their teams.

About the data

Shepherdly composed an analysis of ~75k PRs across 145 private and public repos with contributions from 6,355 authors. Some of these public repos are prominent and covered in more detailed analysis here.

The risk score measures on a scale of 0-100 (0 being the least likely) the likelihood of a change introducing a bug. While the score is normalized across a diverse set of repositories, models were constructed for each team in order to capture the nuances within projects that lead and prevent bugs.

Scores ranging from 0-39 are deemed low risk, 40-59 moderate risk, and 60-100 high risk.

Most PRs are low-risk, still slow to review

54% of PRs are scored between 0-39, an indicator of low probability of introducing a bug.

In this cohort of low-risk PRs (54% of all PRs), their cycle time is nearly 2 days. 

Furthermore, this cohort receives the least amount of review activity. 

The chart above lends credibility to the idea developers are aware of the innate risk and go through the motions of abiding by a one-size-fits-all policy in having to provide some kind of interaction and approve the change. 

While low-risk, this cohort still produces bugs of course. However, it is only responsible for 17% of all bugs created.

Risky changes get more attention, still responsible for nearly half the bugs

The riskiest changes (Scores of 70+) do in fact receive more review scrutiny, resulting in 4.5x more comments than the low-risk cohort.

In terms of duration, the cycle time is 2x longer for this end of the risk spectrum as you may expect, which is a good thing. However, despite this concentration of effort and time, this cohort is responsible for a disproportionate concentration of bugs to the tune of 44%. 

Rethinking code review

The goal is not to abandon code reviews but to enhance them with smarter practices that reflect the true risk and complexity of modern software development. 

The proposal is to decouple review from approval but do so with an objective risk measure that is based on the predictors relevant to your project.

Building on top of Microsoft’s findings, there’s little to be gained from a bug perspective by holding up low-risk changes for days just to get a rubber stamp. Code review should still happen here, just independent of approval.

For high-risk changes, this post recommends the focus expand to resiliency. We can’t stop all the bugs but we can dramatically constrain the blast radius. Observability, feature flags, phased rollouts are the mitigations of choice.

By having an objective risk measure, it provides statistical justification for engineers to speed up or slow down in the areas that maximize the best outcomes for their organization. 

Table Of Contents

CategoryExamplesCollected
A. IdentifiersContact details, such as real name, alias, postal address, telephone or mobile contact number, unique personal identifier, online identifier, Internet Protocol address, email address, and account nameYES
B. Personal information categories listed in the California Customer Records statuteName, contact information, education, employment, employment history, and financial informationNO
C. Protected classification characteristics under California or federal lawGender and date of birthNO
D. Commercial informationTransaction information, purchase history, financial details, and payment informationNO
E. Biometric informationFingerprints and voiceprintsNO
F. Internet or other similar network activityBrowsing history, search history, online behavior, interest data, and interactions with our and other websites, applications, systems, and advertisementsNO
G. Geolocation dataDevice location
H. Audio, electronic, visual, thermal, olfactory, or similar informationImages and audio, video or call recordings created in connection with our business activitiesNO
I. Professional or employment-related informationBusiness contact details in order to provide you our Services at a business level or job title, work history, and professional qualifications if you apply for a job with usNO
J. Education InformationStudent records and directory informationNO
K. Inferences drawn from other personal informationInferences drawn from any of the collected personal information listed above to create a profile or summary about, for example, an individual’s preferences and characteristicsNO
L. Sensitive Personal InformationNO