Intro
One of the biggest time sinks in software development is code review and approval. As such, it’s fair to examine it and have long-held beliefs regarding its merits challenged. Unfortunately, review & approval have been conflated. PR approval is a primary driver of the delays we typically see at this step.
Microsoft examined their own teams and found two eye-opening outcomes. First, the yield on finding bugs in reviews was about 15% of code review comments defects despite it being the top motivation. Second, the social aspects of code reviewing, including team hierarchy and reviewer experience, can significantly influence the outcome of the review.
Building upon Microsoft’s analysis, Shepherdly is releasing similar insights but through the lens of objective risk, measured by its risk score. At a high level, our data confirms that code review is not as effective at finding bugs as we’d like it to be despite additional effort from engineers. This is not to say that code reviews should cease. Instead, this post is advocating that code reviews should be decoupled from approval governed by an objective risk measure.
Evolving code review to this state would allow teams to bifurcate their approval policy such that low-risk changes would not need to sit and wait for a ‘LGTM’ review and the riskiest PRs could have concentrated resilience and mitigation effort.
Coders get less busy work like reviewing inconsequential PRs and reduced context switching. Managers get risk trade-off observability and the precision to dial in risk & velocity trade-offs that make sense for their teams.
About the data
Shepherdly composed an analysis of ~75k PRs across 145 private and public repos with contributions from 6,355 authors. Some of these public repos are prominent and covered in more detailed analysis here.
The risk score measures on a scale of 0-100 (0 being the least likely) the likelihood of a change introducing a bug. While the score is normalized across a diverse set of repositories, models were constructed for each team in order to capture the nuances within projects that lead and prevent bugs.
Scores ranging from 0-39 are deemed low risk, 40-59 moderate risk, and 60-100 high risk.
Most PRs are low-risk, still slow to review
54% of PRs are scored between 0-39, an indicator of low probability of introducing a bug.
In this cohort of low-risk PRs (54% of all PRs), their cycle time is nearly 2 days.
Furthermore, this cohort receives the least amount of review activity.
The chart above lends credibility to the idea developers are aware of the innate risk and go through the motions of abiding by a one-size-fits-all policy in having to provide some kind of interaction and approve the change.
While low-risk, this cohort still produces bugs of course. However, it is only responsible for 17% of all bugs created.
Risky changes get more attention, still responsible for nearly half the bugs
The riskiest changes (Scores of 70+) do in fact receive more review scrutiny, resulting in 4.5x more comments than the low-risk cohort.
In terms of duration, the cycle time is 2x longer for this end of the risk spectrum as you may expect, which is a good thing. However, despite this concentration of effort and time, this cohort is responsible for a disproportionate concentration of bugs to the tune of 44%.
Rethinking code review
The goal is not to abandon code reviews but to enhance them with smarter practices that reflect the true risk and complexity of modern software development.
The proposal is to decouple review from approval but do so with an objective risk measure that is based on the predictors relevant to your project.
Building on top of Microsoft’s findings, there’s little to be gained from a bug perspective by holding up low-risk changes for days just to get a rubber stamp. Code review should still happen here, just independent of approval.
For high-risk changes, this post recommends the focus expand to resiliency. We can’t stop all the bugs but we can dramatically constrain the blast radius. Observability, feature flags, phased rollouts are the mitigations of choice.
By having an objective risk measure, it provides statistical justification for engineers to speed up or slow down in the areas that maximize the best outcomes for their organization.
Category | Examples | Collected |
---|---|---|
A. Identifiers | Contact details, such as real name, alias, postal address, telephone or mobile contact number, unique personal identifier, online identifier, Internet Protocol address, email address, and account name | YES |
B. Personal information categories listed in the California Customer Records statute | Name, contact information, education, employment, employment history, and financial information | NO |
C. Protected classification characteristics under California or federal law | Gender and date of birth | NO |
D. Commercial information | Transaction information, purchase history, financial details, and payment information | NO |
E. Biometric information | Fingerprints and voiceprints | NO |
F. Internet or other similar network activity | Browsing history, search history, online behavior, interest data, and interactions with our and other websites, applications, systems, and advertisements | NO |
G. Geolocation data | Device location | |
H. Audio, electronic, visual, thermal, olfactory, or similar information | Images and audio, video or call recordings created in connection with our business activities | NO |
I. Professional or employment-related information | Business contact details in order to provide you our Services at a business level or job title, work history, and professional qualifications if you apply for a job with us | NO |
J. Education Information | Student records and directory information | NO |
K. Inferences drawn from other personal information | Inferences drawn from any of the collected personal information listed above to create a profile or summary about, for example, an individual’s preferences and characteristics | NO |
L. Sensitive Personal Information | NO |