Fair Developer Score: Build-Adjusted Measurement

In large software projects, measuring developer productivity and impact is hard. Traditionally, companies use metrics like “lines of code written” or “number of commits” to gauge a developer’s output. But any engineer knows those can be very misleading: 100 lines of well-thought-out code that fixes a core bug is far more valuable than 1000 lines of trivial changes. Also, these simple metrics can unfairly favor certain kinds of work (or workers) and create bad incentives. Recognizing these issues, our team set out to design a better metric, something that is fair, context-aware, and actually correlates with meaningful impact. This led to the Fair Developer Score (FDS).

What is the Fair Developer Score?

FDS is a composite metric that evaluates developers on two main dimensions:

Effort – how much did the developer contribute, in terms of actual code change and complexity? This isn’t just lines of code; it considers things like how many files were involved, how central those files are to the architecture, how novel the changes were, and the developer’s ownership of the code (did they write most of the code being changed?).
Importance – how significant was the contribution to the project or product? For example, did the change affect a crucial component of the system? Was it a large-scale change or a minor fix? Did it address a high-priority issue or a new feature? These factors raise the importance of the work done.

The Fair Developer Score for a person is essentially Effort × Importance, aggregated across all their contributions (we focus on code commits). The idea is that a truly high score comes from doing significant work that has significant impact - capturing the notion of “effort aligned with organizational value.”

How It Works (Under the Hood)

We analyzed the problem by looking at commit histories from version control (like Git). The first challenge was to group commits into meaningful units of work. Developers might commit code in many small chunks that actually belong to one logical task or “build”. We applied an algorithm called Torque Clustering to automatically cluster commits that are related (it looks at the time gap between commits, the files changed, and the author). Each cluster of commits is called a “build” - which approximates a feature, bug fix, or task the developer worked on.

For each build, we calculate:

An Effort score for each developer involved in that build. For example, if two devs worked on it, each gets a score proportional to what they contributed. Effort considers:
- Code scale: how large the change was (lines of code, number of edits - log-scaled so it’s not just raw lines).
- Reach: how broad the change was (files and directories touched; making a change in 10 files is bigger reach than in 1 file).
- Centrality: using a PageRank on the project’s file-dependency graph to see how central the changed files are - changing a core library file counts more than a peripheral script.
- Ownership: if you wrote the file originally or have been a major contributor, your edits weigh a bit more (it shows deep knowledge versus drive-by minor tweak).
- Novelty: creating a brand new module or file gets credit for new functionality.
- Speed: not heavily weighted, but finishing a build faster (in tight commit sequence) can indicate focus and efficiency.
An Importance score for the build itself. This doesn’t depend on who did it, but rather how valuable that build is to the project. We look at:
- Scope: total lines of code changed (a proxy for scale of change) and the distribution of changes (many files vs one concentrated area).
- Arch. Impact: similar centrality measure - changes in critical parts of the system raise importance.
- Complexity: was this build touching many different components (which could introduce complexity)?
- Task priority: if we have links to issue trackers or commit messages, we detect if it was a hotfix, a major feature, or a routine update. (We trained a simple classifier to flag high-priority vs low-priority from commit text).
- Release proximity: work done right before a major release or deadline might be more impactful. For example, finishing a feature in time for a big release is crucial, whereas a similar change far from any release might be less urgent.

Finally, for each developer, we aggregate their Effort × Importance for all the builds they participated in during a time window (say a quarter). That gives their Fair Developer Score for that period.

Validation: Linux Kernel Case Study

We first validated FDS on the Linux kernel, one of the largest and longest-running open-source projects. Over a 974-day window, we computed FDS for 339 contributors.

We identified the top decile by FDS and compared them to a same-sized group of top commit-count developers. To make this comparison fair, we performed one-to-one matching using the Hungarian algorithm on total churn, total files changed, and the number of unique builds. Both groups had essentially the same volume exposure.

Results:

FDS-ranked developers showed higher Average Importance and higher Average Effort than their commit-count peers
For the same amount of raw work, FDS surfaces people doing more impactful and substantive work
We observed a lower short-interval rework rate (fewer cases where a developer revisits the same directory within 48 hours) for FDS-ranked developers, consistent with more sustainable, less churn-prone contributions

Cross-Repository Evaluation

To test generalizability, we applied FDS to four additional major projects: Kubernetes, TensorFlow, Apache Kafka, and PostgreSQL.

Effort Signal: Robust Across Projects

In all five repositories, top-FDS developers had higher Effort than volume-matched commit-count peers. Effort advantages were statistically significant in 4 of 5 projects. The Effort model is relatively universal.

Effort’s resilience stems from how it weights structural reach and ownership. Touching core scheduler files in Kubernetes, for example, has the same “centrality + reach” footprint as editing Kafka’s replication manager. That structural symmetry made Effort portable, and it means FDS can be rolled out without weeks of reconfiguration.

Importance Signal: Context-Sensitive

The Importance model showed more variation across projects, which actually validates its design. Different projects have different definitions of what constitutes "important" work, and our results reflect that:

Significant positive lifts in Linux, Kubernetes, and Kafka, where FDS-ranked developers work on systematically higher-importance builds
TensorFlow: Average Importance was slightly lower for FDS-ranked developers, likely reflecting heavy use of automation, generated code, and large mechanical changes that FDS's Importance model downweights
PostgreSQL: Small, non-significant differences (smaller contributor set)

This pattern is exactly what we'd expect: the Effort model is relatively universal, while the Importance model should be tuned to each project's workflow, release cadence, and automation profile.

Why Fair Developer Score?

The goal was to bring fairness and context into productivity metrics:

Fairness: Developers who take on tough, important work should get credit, even if that doesn’t translate to raw line counts. Meanwhile, those doing lots of trivial updates shouldn’t artificially appear to be the top just by volume. FDS helps mitigate biases like favoring quantity over quality.
Holistic view: It encourages a culture where both effort (hard work, complex coding tasks) and impact (doing what the team/project really needs) are valued. This can guide better behavior: e.g., refactoring a critical module might be more valued than adding a superficial feature, even if the latter adds more lines of code.
Actionable insights: Managers or leads could use FDS to identify unsung heroes (someone with slightly fewer commits but very high importance work) or to notice if someone is doing a lot of work that doesn’t translate into impact (maybe they’re stuck in toil tasks, which could flag a process problem).

Challenges & What’s Next

No metric is perfect. We acknowledge several limitations:

FDS currently looks only at Git commits. It doesn’t directly account for code review feedback, design work, mentoring, or other “invisible” contributions. Those are super important too! Future iterations might integrate data from code reviews or design docs to round out the picture.
There’s a risk with any metric: if used improperly, people might try to “game” it. For example, if someone knows the formula, they might try to bundle commits in certain ways. We propose FDS as a helpful diagnostic tool, not as a strict KPI to reward/punish developers blindly. It should always be used alongside human judgment.
We want to test FDS in more environments: we’ve done open-source, but what about a closed-source corporate repo? Or a smaller team project? The thresholds for what’s “important” might differ. Part of future work is making the model adaptive to different contexts (maybe via some tuning parameters or training on historical project data).

Conclusion

Fair Developer Score is our attempt to move beyond naive metrics and capture a more meaningful picture of engineering productivity. By focusing on what was done and why it matters, not just how much, we aim to recognize the developers who truly drive a project forward. Our initial results are promising; FDS aligns with intuitive assessments of impact and filters out noise. We hope this framework can spark conversations in both industry and academia about better metrics for software engineering work. Ultimately, the goal is to help teams celebrate the right kind of contributions and guide improvement in a positive way.

Want the full presentation? The complete ASE 2025 slide deck is embedded below.

Authors

Xinzhou Wang · Jiancong Zhu · Jinghan Feng · Zixuan Zhang · Joshua Rauvola · Devon Delgado · Ahmad Antar · Abid Ali

Northwestern University · University of Chicago · Digital Emissions

Prefer the full paper? Download the complete PDF.

Curious about AI efficiency? Read ThoughtTrim: Anchor-Driven RL Modification or return to the journal.