Dr. Wim Westera is a physicist and educational technologist. In his role as Head of Educational Implementation at the Educational Technology Expertise Center of the Open University of the Netherlands, he combines service, educational media development and innovation. He leads a group of 70 educational designers, media specialists and IT developers. He is also a master athlete and a racing cyclist.
author argues that the current IAAF decathlon scoring tables display
unacceptable bias as they favour some events over others. Performances in the
sprints benefit disproportionately to those in the throwing events and the
7500m. Moreover, the system is intrinsically unstable and tends to increase the
differences between disciplines over the course of time. This paper investigates
alternative scoring methods. It elaborates a well-grounded procedure to express
the performance scales of the events in a normalised form in order to allow
comparisons. Three alternative scoring models are developed as candidates for
replacing the existing model. These are based on 1) a power law description, 2)
a parabolic description and 3) an exponential description, respectively. The
proposed methods are uniform over the events and support self-stabilisation. They
combine practical evidence and sound principles. Calibration to the current
model is performed with existing data in order to enable a smooth transition
from current practice. Overall effects are limited, if not negligible. Under
each of the proposed models two of the current all time top 100 performers would
improve their ranking substantially and all three models indicate the current
number two in the ranking, Thomas Dvorak (CZE) should actually be the world
It turns out that there are quite significant differences between the
disciplines. The athletes seem to profit disproportionately from the long jump,
the 110m hurdles and the 100m, while in contrast the 1500m, javelin, discus
throw and shot put are highly unfavorable. Apparently, top decathletes tended
to specialize in sprinting, which indeed may be regarded a common denominator of
the long jump, the 110m hurdles and the 100m. Throwing capabilities and
endurance, however, seem to be far less profitable and may even interfere with
One might be tempted to infer from this pattern that performances in the throwing events and 1500m are lagging behind and thus leave more room for further improvements than the sprint-based events, but this conclusion is not tenable. First, this would reflect an embarrassing disregard of the fact that the top decathletes go to the limits of each discipline in any possible way. It would be naive to assume that substantial improvements were possible, even if radical changes in training were to be applied. It is not the athletes who should be blamed for the apparently sub-optimal performance, but the scoring method itself. In principle, the top 100 average score should be equally distributed over the events. Indeed, decathletes who have achieved scores that rank in the all time top 100 are the only candidates to set the empirical standards for genuine all-round performances. Any anomalies in the performance pattern of Figure 1 should thus be ascribed to imperfections of the scoring method.
Secondly, the self-corrective nature of the performance pattern is refuted by
the numeric gradient in the scoring tables: a 1% increase of the long jump
performance yields 19 extra points, whereas the same increase in discus throw
yields only 9 points and javelin and shot put will bring only 10 points.
Improving the 100m performance 1% would produce 24 points! This pattern implies
a positive feedback loop for sprinting-based performances at the expense of
throwing skills and endurance. Therefore, the different scores in Figure 1
cannot be regarded as a temporary or coincidental deviation from equilibrium; on
the contrary, the pattern seems to be highly unstable and will probably show
increasing differences between disciplines in the course of time. Current
decathletes are excellent sprinters. Apparently, this is a self-establishing
fact, because further specialization in the sprints pays off. As such a tendency
conflicts with the premise that the decathlon champion should be the best
all-round athlete rather than a solid sprinter, modifications of the scoring
method are inescapable.
The current scoring method
Even though we have observed some problems with the current scoring method, we want to emphasise that the method as such is quite sophisticated. It uses objective, unambiguous, quantifiable performance data (i.e. time and distance) and avoids the subjective assessments of jurors (aesthetics, expression) that cause so many problems in the rating of gymnastics, figure skating or dressage. It also avoids complicated and probably unfair multistage accounting systems, like the system of rally points, games and sets in tennis. Such systems are used for historical rather than logical reasons. Furthermore, the scoring tables are progressive in kind, as will be explained below. These are far better than the linear systems that are still being used elsewhere, for instance in the combined events of speed skating.
The current scoring tables were adopted in 1984 after extensive discussions, negotiations and compromises. The process took into account an abundant amount of empirical evidence. Basically, the current scoring method for each discipline is covered by a mathematical expression of the type.
S(P) = A.(P-B)C (1)
● P is the performance (i.e. the achieved distance in the long jump).
● S is the score (the number of ascribed points).
● A, B & C are event-dependent parameters that define the nature of the scoring table.
For running events (P-B) should be replaced with (B-P) because of the descending
nature of performance with time. Note that the performance assessment method
stages: first the performance P is measured (in units of time or distance), next
the performances are converted to a score S in order to allow addition.
Clearly, it is this second stage of assessment that is problematic.
Figure 2 shows the scoring curve for the long jump, according to equation (1). It uses the following values: A=0.14354, B=220 em, C=1.40, while P is expressed in cm.
Such scoring curves have the following characteristics. The parameter B
defines a threshold value (2.20m), below, which no score is assigned. This is
substantiated by the assumption that any athlete is assumed to reach such
distance without any effort whatsoever and therefore will not receive any points
for performances below B. Above this threshold value the performances are rated
through a slightly progressive curve, the nature of which is mainly determined
by parameter C. The underlying idea of this nonlinearity is that an improvement at low performance levels is much easier than an improvement at
high performance levels. Indeed, improving the long jump from 8.00m to 8.20m is
far more impressive than the same improvement from 4.00m to 4.20m as the scoring
table assigns 51 extra points against 33 extra points. The overall scaling of
the curve is determined by a parameter A. Thus, the current decathlon scoring
method comprises a set of 10 power laws that is specified
by 30 calibration parameters (A, B and C for each of the 10 events).
In due course, several inadequacies seem to have crept into the scoring method. Moreover, the theoretical foundation of the formula is weak. The progressive form is assumed to be associated with the kinetic energy that an athlete has to develop during the event, irrespective of whether a run, jump or throw is involved. This would suggest that performance is proportional to squared speed (v'). which would indirectly suggest a progressive form with power 2.0. In practice, however, it was necessary to apply power functions with exponents (parameter C) well below 2.0, with some variations over the disciplines (i.e. javelin C=1.08; long jump C=1.40; 100m C= 1.81). Note that the progression of the curves is partly determined by the threshold values B (i.e. javelin B= 7.0 m; long jump B= 2.20m; 100m B= 18.0 see). The current tables are pragmatic in kind rather than based on solid explanation. Consequently, some arbitrariness is involved. Indeed, it is difficult to explain why the long jump scoring table should start at 2.20m rather than at 2.40m, 1.80m or even at 0.00m. An additional weakness of the current system is its inability to self correct in due course: as indicated before, the current scoring system is intrinsically unstable in that differences between disciplines tend to increase rather than fade away. These observations amplify our call for a revision.
Premises for justified rating
Before we elaborate alternative scoring methods, we will list the basic requirements they should meet. The envisioned methods should:
Allow a fair comparison between events;
Be uniform over all events (this follows from the starting point of the decathlon);
Use objective standards (distance and time measurements);
Be grounded in empirical evidence from the decathlon (practical significance);
Be based on sound principles and logic (consistent, transparent and substantiated);
Be stable over time and thus possess self-stabilising characteristics;
Allow smooth transitions from the existing model (acceptability).
Naturally, the method must be credible and acceptable in that it should not
degrade obvious top athletes to middle-of-the-road performers. This holds
even when it comprises the paradox that we reject the current method but still
demand the new system to yield more or less similar outcomes.
Next, we will explore new scoring models in two stages. First, we will develop and discuss a procedure to express the diverse performance scales in a normalised form in order to allow comparisons. Second, we will develop alternative expressions for the scoring function SIP).
Normalization of performance scales
Any scoring method for combined events is doomed to compare apples and oranges, as it combines and reckons with different processes, different variables and different types of performances. In order to enable a comparison of one type of performance with another type of performance we need a way to transform each performance scale into a normalised form. As will turn out below, such normalization of decathlon performances can be achieved much easier than in the case of apples and oranges.
In the current system, throwing and jumping performances are expressed in a straightforward way by the achieved distance: larger distances correspond with better performances. In running events, however, performances are expressed in the length of time needed. Consequently, running performances and their quantification are inversely related, rather than linearly: the less time needed, the higher the score. In order to achieve a sensible normalization procedure we first have to align time measurement and distance measurement. Let the performance in a certain event be quantified by a performance variable P. To be consistent in terminology, high performances should correspond with large values of P. Figure 3 displays such a performance axis.
In accordance with the current system, we may define a threshold performance Po
that would correspond with the performance below which no score is assigned
(5=0). In the current system Po is given by parameter B in equation (1). The
value P=Q would correspond with the ultimate inactivity. Naturally, such
performance scale easily matches the distance scale of throwing events and
jumping events. For running events, the performances should no longer be
expressed in units of time, but rather in units of speed or, likewise, in units
of reciprocal time. If so, the value P=O would correspond with the ultimate
inactivity: indeed, it would take forever.
With such alignment of the throwing-jumping events and running events in mind, the definition of a normalised performance scale can be formalized by a linear transformation of the performance variable P. We would need two calibration values. Po and Pl, to define the normalised performance PN(P) of a performance P in a particular event:
Pl represents the high-end calibration value of the performance scale, whereas
the performance threshold Po is the low-end calibration value.
As for the high-end calibration value Pl, we would need a stable reference value that represents high performances. While maximum performance is indefinite, per se, we propose to equate P1 with the average of the all time top 100 performances that have been used before in Figure 1. This choice may seem somewhat arbitrary, but as it being used for the relative alignment of the performance scales of the various events, it is not critical. We might have chosen the top 50 average as a reference as well, or even the world record data. This would indeed produce different transformations (cf. Equation (2)). but it would still preserve the idea of normalization. Actually, what matters is that the data is representative. By using the all time top 100 average existing peaks and exceptions are dimmed by the statistics. The current averages of the performances in the all time top 100 decathlons' are listed in Table 1.
So, when we choose the values of P1 to correspond with the average
performances listed in Table 1, we conform to the idea that athletes who
achieve all time top 100 decathlon scores have the same normalised performance
(e.g. PN(P1)= 1) for each event. Consequently,
this means that 10.76s for the 100m is the same performance as a long jump of
7.66m and so on. In fact, starting from the principle of all-roundness, this is
the only sensible decision. It also means that the associated scores S(P1) (cf.
Figure 1) should be the same for each event. Note that this (arbitrary)
normalization of the performance P does not mean that PN has an upper limit of 1;
indeed, PN may become larger than 1 if P> P1, naturally when performances
exceed the top 100 average (which may occur regularly).
For the low-end calibration of value Po, the official threshold parameters B, as defined in the current scoring method (cf. Equation (1)) may seem interesting candidates. However, in contrast with the values of P1 (the all time top 100 averages), which represent exemplary, real and reliable data, the current values of B are quite problematic, because they are the result of accumulated modifications loaded with historical bias and lack logical foundation. The origins of the existing values B are unclear and their fairness is questionable. Therefore, the indiscriminate import of these existing threshold values, which for their part may be an important cause of the unbalance in the current scoring method, is not acceptable. This becomes manifest when we list the current IAAF threshold values B relative to the high-end performances P1 (cf. the third column in Table 2).
It appears that the relative positions of the current IAAF threshold values B are very different for the different disciplines. Relative positions spread over a factor of 7, ranging from 0.085 for discus throw to 0.598 for the 100m. Theoretically a different threshold for each event might be plausible, because, indeed, each discipline requires different techniques, different muscles and different procedures. Yet, the current thresholds seem to display quite a degree arbitrariness and break through the uniformity of the disciplines without any foundation. Our proposition here is that in the absence of any reasoning about the physical parameters that would substantiate the necessity of different thresholds, a uniform approach over the disciplines is indicated. Indeed, if uniformity over all disciplines is our starting point, Po should be at the same position for each event. This means that we want P0/P1 to be a constant. A first approximation of P0/P1 would be the average of B/P1, which yields a ratio of 0.340. Using this ratio produces a uniform estimate for the threshold values for each discipline (cf., Table 2, fourth column). Note the substantial differences between our uniform threshold values Po (fourth column) and the current IAAF thresholds B (second column). especially in the running and throwing events.
Current scoring method: comparison of events
The performance normalization procedure described above allows us to display the current scoring curves (cf. Equation (1)) at normalised performance PN (cf. Equation (2)). Figure 4 displays the results for 5 of the events. Similar curves result for the other events.
From Figure 4 we conclude that the unbalance of the scoring is not restricted
to high end performances as was inferred from Figure 1, but that it is present
at all performance levels. Note that the curves not only have different scoring
levels, but also very different curvatures and associated gradients. These
different gradients imply that equal (normalised) performance improvements are
rated differently in each discipline. The calculations confirm our
preliminary conclusion that these differences cause the scoring system to be
intrinsically unstable. We remark that the calculations indicate that throwing
events (shot, javelin and discus) have very similar curves, which differ only up
to 4%. Such resemblance might be expected with events that technically have many
points in common. Similarly, running events seem to display a common pattern
too: a steep rise at high performances. Yet, the differences in running scores
are much greater, as is the case for the jumping events.
Note that the curves in Figure 4 only represent an intermediate stage of our analysis, because the normalisation affects only the horizontal scale, while the vertical scale is kept unchanged. As a consequence, one may signal some inconsistency while the horizontal scale uses the uniform threshold values of
PO, according to Table 2, whereas the vertical scale still uses the current IAAF thresholds B, according to Equation (1). In the next section we will elaborate alternative methods to redefine the vertical scale.
Towards alternative scoring methods
So far, the divergence of the scoring curves in Figure 4 is an embarrassing confirmation of the inappropriateness of the current scoring method. If the normalisation procedure according to Equation (2) is accepted to be valid, the scoring curves of the various events should coincide rather than diverge. In accordance with the principles of the decathlon, the scoring should be uniform over all disciplines. This means that we have to redefine the scoring formula S (P) of Equation (1) in a uniform way. As a further constraint, we refer to the calibration values P0 and P1 that we have used to transform performance values P into a normalised form. For the threshold value Po it follows that:
S(P0) = 0 (5)
It turns out that the average all time top 100 decathlete has a score of 8639 points. Because the scoring curve S is assumed to be uniform over all events, it follows that for each event:
S(P1) = 863.9 (6)
Such empirical calibration ensures that the total scores of the all time top 100 decathletes stay in the same range as the current scores, in accordance with our premise.
Naturally, when we want to rewrite the scoring function S as a function SN of the normalised performances PN, according to Equations (2). (3) and (4), we obtain:
SN(0) = S(P0) = 0 (7)
SN(1) = S(P1) = 863.9
While uniformity over all disciplines is assumed for SN, we have to find and substantiate a progressive curve with two fixed points, given by Equations (7) and (8). Below we will present three alternative approaches, the results of which are presented in Figure 5. The three models will be explained below.
Model I: Power law
In accordance with the current scoring method, we assume that SN can be described by a power law:
SN (PN) = A,(PN)C (9)
From Equations (2) and (9) we find that the regular scoring function S(P) can be written as:
S (P) = A,((P P0)/(P1-
Note that this power law approach significantly differs from the current IAAF power law in that performance in the running events is expressed in units of reciprocal time, rather than in units of negative time (cf. Equation (1)). Also, it follows from the uniformity of SN that A and C are constant over the events, in contrast with the current IAAF scoring method which demands different values of A and C for each discipline.
The constraint in Equation (6) gives:
A = 863.9 (11 )
The only remaining unknown in Equation (10) is the power C. Naturally, the value of C determines the progressive form of the scoring curve, so it follows that C> 1. A simple estimate of C can be obtained by conforming to the 10 IAAF power parameters C that are used in the current scoring metho. When we equate C with the average of these current powers we find:
C = 1.479 (12)
The resulting score curve is displayed in Figure 5. When we compare the suggested power law curve of Figure 5 with the scoring curves in Figure 4, it turns out that the new curve has an intermediate position. Coincidentally, the new curve almost coincides with the high lump curve in Figure 4. Relevant data for this suggested power law curve are summarised in Table 3.
Model II: Parabolic
It was mentioned above that the progressive form of the scoring curve may be associated with the role of the kinetic energy that is developed by the athlete. Along this line of thought the resulting scoring curve should be parabolic, because the performance P is always expressed in units of distance or units of (reciprocal) time. This argumentation, however, is not very specific, and it omits the effects of the different techniques and constraints of the disciplines. Yet, there is another reasoning that underpins the likelihood of a parabolic scoring curve. To find a solid basis for the progressive behaviour we should return to the basic idea that progression reflects that the gradient of the scoring curve increases with performance. In mathematical terms we state the premise that the extra score dSN(PN) that follows a performance improvement dPN is proportional with the performance PN:
dSN(PN) ~ PN . dPN (13)
Indeed, achieving a performance increment dPN at a high performance level PN produces more points dSN than the same increment at a lower level. Integrating Equation (13) gives a parabolic dependence:
SN(PN) = A . (PN)2 (14)
Note that this parabolic curve is a special case of the power law of Equation (9), i.e. C=2. The scaling constant A is given by Equation (11).
As can be seen from Figure 5, the parabolic curve (C=2) is slightly more progressive than the power law curve, which uses only a power of C= 1.479. Differences between the two scoring curves are up to a few percent (0 40 points) in the high performance area (PN>0.9) and rise up to 100 points at low performances (PN=O.55). Relevant data for this suggested parabolic curve are summarised in Table 3.
Model III: Exponential
The progression of the scoring curve can also be approached with statistics. Indeed, progression may be assumed to reflect the reduced chance of success at increased performances. To define progression statistically we state that the extra score dSN(PN) that follows a performance improvement dPN is inversely proportional with the occurrence or frequency f(PN) of performance PN in the population of decathletes:
dSN(PN) ~ 1/f(PN) . dPN (15)
While the frequency f(PN) may be assumed to descend monotonously indeed fewer and fewer athletes will be able to achieve better performances -, a performance improvement dPN is more greatly rewarded at high performance levels.
The next question would be: what evidence is available about the frequency f(PN)? The standard approach to sort out f(PN) would be to take a random sample to represent the population of all decathletes. This, however, introduces two severe conceptual problems. First, while gathering results from decathlon competitions, national and international ranking lists and so on, we would be taking biased samples that only represent the local top 10 or top 50 participants and disregard large groups of modal athletes who make up the majority of the decathlon population. Secondly, the combination of data in different performance intervals, e.g. the combination of international data and sets of regional data, is not straightforward but should be linked with the relative occurrences in the performance intervals. Obviously, combining the results of the World Championships with the data of some unimportant event would not produce a representative sample for the decathlon population. This problem is circular in kind and thus irresolvable: to derive the performance distribution f(PN) from combined results in various intervals we would need to know the relative occurrences, which are given by f(PN) itself. Therefore, empirical occurrence data will not be of any help here.
A second approach would be to suggest a theoretical probability distribution, by investigating the conditions of the probability process. Although various well-known distribution functions like the Poisson distribution or the normal distribution may be interesting candidates, we would still need good empirical estimates of the distribution's mean and variance. Also, we have to consider that only (part of) the descending tail of such distribution is of relevance, because only the descending tail reflects increasing failure; in contrast, the ascending tail at low performances represents the fact that most athletes easily exceed these low performances.
These observations indicate severe difficulties in a straightforward and successful application of a statistical analysis. However, we have some indications that the performance distribution function might be approximated by the negative exponential distribution:
f(PN) ~ e-λPN (16) where λ is a constant.
This choice is underpinned by the following arguments:
The exponential distribution is often associated with the survival of species in biology or similar processes that account for failures and drop-outs, for instance the reliability of technical components. The process of survival has many things in common with sports events. Consider, for example, the high jump and pole vault, where the requested performance of athletes is incremented in steps, until eventually all competitors have dropped out. Theoretically all decathlon events can be mapped on to this approach and thus match a regular survival pattern.
As will be demonstrated below, the premise of Equation (16) provides a monotonous progressive scoring curve and thus fits our objectives.
Clearly, the probability function f in Equation (16) can be regarded the solution of the following differential equation: df(PN) ~ f(PN). dPN (17)
This equation establishes the sensible premise that a performance increment dPN causes a frequency change df(PN) that is linearly proportional with f(PN).
The exponential distribution is simple in its form, it has only one parameter (λ) and it can be integrated analytically.
When we combine Equations (15) and (16) and integrate and make use of Equations
(7) and (8), we obtain the following progressive expression:
SN(PN) = A. (eλPN-l)/ (eλ-l) (18)
Again A is given by Equation (11). To decide on the value of I we set the pragmatic requirement that the exponential curve has an intermediate position between the power curve and the parabolic curve. By minimising the total squared differences between the curves at the interval [0.1] we find λ=1.602. The resulting exponential curve is shown in Figure 5. Although our fitting procedure implies an intermediate curve, the exponential relationship creates a relatively strong progression at high-end performances (PN> 1). Differences with the power law curve are up to 15 per cent in the midrange (up to 60 points). Note that the inverse value of I represents that expect value of the performance PN. This would indicate an average performance of <PN>=1/λ =0.62422. Relevant data for this suggested exponential curve are summarised in Table 3.
All three suggested models meet the requirements for a justified rating for which we have expressed a need. The normalisation procedure allows a fair comparison between events. The proposed scoring methods are uniform over the events and support self-stabilisation. They combine practical evidence and sound principles. Various calibrations to the existing model would allow smooth transitions from the current method. As a consequence, overall effects are limited if not negligible.
In the all time top 100 ranking the average change is 10 positions for each of the models, which corresponds with relative improvement (or degradation) of 30%. The biggest leap is observed for the number 59 athlete in the current ranking (Mike Smith (CAN)), who may be assumed to be greatly underrated and put at a disadvantage by the current system because of relatively poor sprinting (100m in 11.23; 110m hurdles in 14.77). Both the parabolic method and the power method allocate Smith a rank of 8th; the exponential yields a rank of 4th. Likewise number 21 in the IAAF ranking (Uwe Freimuth (GER): 11.03, 14.66) enters the all time top 10: 6th (parabolic), 7th (power) or 5th (exponential).
From this we see that the alternative models seem to counteract the sprint bias of the current model. Remarkably, all three models indicate a new world record holder, or rather a reinstatement of the old record holder, as Thomas Dvorak's (CZE) 1999 performance in Prague outstrips Roman Sebrle's (CZE) subsequent mark from 2001 in Gotzis, which is unanimously ascribed to 2nd. Dvorak's record would read 9232 points using the power law, 9469 with the parabolic or 9777 for the exponential curve. Note that these scores greatly exceed Sebrle's currently recognised mark of 9026, especially in the case of parabolic and exponential scoring due to the relatively high progression of the curves at world level performances. The medallists at the 2005 World Championships in Athletics' would remain unchanged under the three alternative methods, although Brian Clay's (USA) winning margin would be even more pronounced, due to the same effect.
In this paper we have shown that the current decathlon scoring method suffers from severe bias and produces unfair outcomes. It would need a revision to become more balanced and stable. We have demonstrated that it is possible to devise alternative scoring methods that are uniform, balanced and substantiated and that avoid the negative effects of the current method. On several occasion we have chosen to estimate or calibrate data by falling back on existing habits or data (e.g. performance thresholds P0, power C) in order to connect to existing practice. One may wonder about the exact value of the power parameter C. or one may question the necessity to define thresholds P0 at all. Indeed, other choices are possible and arguable, possibly with different outcomes and consequences, but the quintessence of this paper is to present a proof of concept of appropriate alternatives.
The presented models not only have greater plausibility, they also are much simpler and need fewer parameters. Instead of 30 parameters in the current model, the power law method uses only 22 (magnitude A. power C and 10 times P0 and P1). as does the exponential model (magnitude A, rate A and 10 times P0 and P1); the parabolic method uses 21 (magnitude A and 10 times P0 and P1). This reduction is an improvement as, according to "Ockham's Law of Parsimony" or "Principle of Economy" (called "Ockham's razor") one should make no more assumptions than needed to explain ascertained facts. It supposes that the same principle of simplicity prevails in the physical cosmos, since the laws of nature are governed by the tendency towards minimum energy and a minimum number of degrees of freedom.
Such a principle of economy would indeed fairly suit the efforts of decathletes who seek to challenge the limits of performance, equally in all events.
FROM: IAAF/NSA 1-06