Statistical Misuse of Ordinal Scales: The Mathematical and Ethical Flaws of Averaging Planning Poker Metrics
Introduction
In Agile software development, metrics like Planning Poker story points are widely used to estimate the size and complexity of work items. These metrics are based on ordinal scales—a type of ranking where the relative order of items matters, but the exact differences between them do not. Despite this, it’s common practice to calculate averages, run regressions, and otherwise apply standard mathematical operations to such data. This statistical misuse isn’t just a technical mistake; it has real-world consequences for decision-making and can cross into the realm of ethical misrepresentation. In this blog post, we examine the nature of ordinal data, why treating it as interval data is problematic, and the ethical implications for teams and organizations. We also provide guidance to help avoid these pitfalls, concluding with a question for readers to reflect on their own experiences.
Understanding Ordinal Scales in Agile Contexts
What Is an Ordinal Scale?
An ordinal scale is a way of ranking items or outcomes according to some criterion, but without specifying the degree of difference between them. For example, a restaurant rating system (poor, fair, good, excellent) or a pain scale (mild, moderate, severe) are ordinal. In Agile, Planning Poker uses a sequence of numbers (often Fibonacci: 1, 2, 3, 5, 8, 13, etc.) to estimate effort, but the gaps between these numbers are not consistent or meaningful in a mathematical sense.
Why Do Teams Use Ordinal Scales?
Ordinal scales like Planning Poker sequences are practical for group estimation, helping to drive consensus and discussion. They acknowledge the uncertainty and subjectivity inherent in software estimation, allowing teams to quickly rank work items from smallest to largest without worrying about precise measurement.
Statistical Misuse: Averages and Regressions on Ordinal Data
The Mathematics of Ordinal Data
Ordinal data only tells us the order of items, not the magnitude of differences. For example, the difference in effort between a 2-point and a 3-point story is not necessarily the same as between a 5-point and an 8-point story. Treating these numbers as if they are evenly spaced (like real numbers on a ruler) violates the fundamental properties of ordinal data.
The Flaws of Mathematical Averages
Despite this, many teams and organizations calculate the average story point value for a sprint, or the average velocity across sprints. They may even run regressions to forecast future delivery. However, calculating averages or running arithmetic operations on ordinal data is mathematically unsound because:
- The intervals between points are not consistent or meaningful.
- The results can be misleading, producing averages that do not correspond to any real scenario (e.g., an average story size of 4.2 points).
- It gives a false sense of precision and objectivity.
Some organizations take it further, applying regression analysis or more complex statistical models to ordinal data. These methods assume interval or ratio-level data, where arithmetic operations are valid. Using them on ordinal metrics produces results that are, at best, spurious and, at worst, drive misguided decisions.
Real-World Consequences of Statistical Misuse
Poor Decision-Making
Relying on mathematically flawed averages or projections leads to poor planning, unrealistic commitments, and ultimately, failed projects. Teams may be pushed to deliver "average" story sizes that are not grounded in reality or pressured to meet forecasted velocities that have no statistical validity.
Erosion of Trust
When stakeholders realize that the numbers don’t add up—or worse, when projects fail due to flawed metrics—trust in the estimation process and in leadership breaks down.
Ethical Implications
Misrepresenting ordinal metrics as if they were interval or ratio data is more than just a technical error; it’s an ethical lapse. It can:
- Deceive stakeholders about team performance or project predictability.
- Lead to unfair evaluations of teams or individuals based on invalid data.
- Undermine psychological safety, as teams feel pressured to "hit the numbers."
Best Practices: Using Ordinal Metrics Responsibly
- Recognize the Limits: Treat story points and other ordinal metrics as relative rankings, not precise measurements.
- Avoid Arithmetic Operations: Don’t calculate averages or run regressions on ordinal data. Instead, look at frequency counts, medians, or modes.
- Educate Stakeholders: Ensure that everyone understands what ordinal metrics mean and how they should (and should not) be used.
- Report with Integrity: Be transparent about the limitations of your data and the methods used to analyse it.
- Focus on Conversation: Use ordinal metrics to drive discussion and consensus, not to produce misleading statistics.
Ordinal metrics like Planning Poker story points have value when used as intended—to facilitate team discussion and consensus. But applying standard mathematical operations to these numbers is both mathematically invalid and ethically questionable. By respecting the true nature of ordinal data and reporting it with integrity, teams and organizations can avoid misleading themselves and their stakeholders, making better decisions and building greater trust.
Question for Readers:
Have you encountered situations where averages or advanced analytics were applied to ordinal metrics like story points or Planning Poker estimates? How did it affect planning, transparency, or trust in your teams?
Share your experiences and insights below.



