Dimension-Corrected Somers’ D for the Item Analysis Settings

A new index of item discrimination power (IDP), dimension-corrected Somers’ D (D2) is proposed. Somers’ D is one of the superior alternatives for item–total(Rit) and item–rest correlation (Rir) in reflecting the real IDP with items with scales 0/1 and 0/1/2, that is, up to three categories. D also reaches the extreme value +1 and ‒1 correctly while Rit and Rir cannot reach the ultimate values in the real-life testing settings. However, when the item has four categories or more, Somers’ D underestimates IDP more than Pearson correlation. A simple correction to Somers’ D in the polytomous case seems to lead to be effective in item analysis settings. In the simulation with real-life items, D2 showed very few cases of obvious underestimation and practically no cases of obvious overestimation. With certain restrictions discussed in the article, D2 seems to be a good alternative for these classic estimators not only with dichotomous items but also with the polytomous ones. In general, the magnitudes of the estimates by D2 are higher than those by Rit, Rir, and polychoric correlation and they seem to be close of those of biand polyserial correlation coefficients without out-of-range values.


Item discrimination and the deterministic pattern
Item discrimination power (IDP)-one of the three essential parameters of a test item-is classically defined as the efficiency of a single item to discriminate between lower-and higher-scoring test-takers (see Educational Testing Service [ETS], 2020; Liu, 2008;Lord & Novick, 1968;MacDonald & Paunonen, 2002). Metsämuuronen (2020a) notes that this loose definition is not very practical while assessing the possible under-and overestimation produced by different estimators of IDP in the real-life settings. Hence, he discusses an operational definition of IDP related to the concept of deterministic item discrimination. Deterministic item discrimination refers to the pattern in which the score explains perfectly the behavior in the item, and then we expect to see the perfect explaining power between two variables ( 2 1 XY 1 j j     , j = 1, 2, …, S as illustrated in Figure 1. We define that 0 0     and R S     , and we assume that 1 i r g g g  and 1 js x x x  .

Figure 1. A latent variable  categorized into two ordinal scales and the number of times the observation (gi, xj) is obtained in the sample (n g X .)
From the traditional viewpoint of correlation coefficients related to the item analysis, the observed correlation between the interval-scaled variables g and X is item-total correlation ( gX  ), the observed correlation between the binary g and ordinal X is rank-biserial correlation ( RB  ), the inferred correlation between the latent  and observed X is polyserial correlation ( X   ), and the inferred correlation between two latent variables is polychoric correlation (   ). In the last, within the measurement modelling settings, we would expect to obtain perfect correlation, and the further   is from 1, the more measurement error is included in the measurement instruments, including both items and the score. We note that the correlational viewpoint to the item discrimination is based on covariance between the item and the score.
From the viewpoint relevant with this article, the family of Somers' D, including Kendall's tau-a and tau-b (Kendall, 1938), and Goodman-Kruskal G (Goodman & Kruskal, 1954), item discrimination is approached from the probability g X 1 2 R-1 R 1 2 S-1 S n 1 1 n 2 1 n R 1 n 1 2 n 2 2 n R 2 n 1 S-1 n 2 S-1 n R S-1 n 1 S n 2 S n R S International Journal of Educational Methodology  299 viewpoint. Somers' D estimates the probability (π) that two randomly chosen pair of test-takers have the same order in both the item and score (see Van der Ark & Van Aert, 2015). The probability for the same order is The probabilities of tied pairs related to rows and columns are R T  and C T  , respectively. The latent δ proportions the probabilities of P and Q with maximal possible number of pairs to the same direction (including also the tied pairs). Hence, the relevant direction related to the article, that is, the latent δ conditioned so that the column factor explains the row factor is defined as (2)

Somers' D in the practical item analysis settings
Somers' D approximates the latent δ. The computational forms of Somers' D are usually expressed by using the concepts of concordance and discordance between the values of g and X. By using the concepts of P and Q, the specific coefficient relevant to item analysis, D given g in condition of X, that is, Somers'   D g X † , has a simplified form of where gi n is the number of cases in the categories g = i related to item g and  (Metsämuuronen, 2017b;Siegel & Castellan, 1988; note that, in the literature related to Somers' D, this is notated as   D X g ).  (2) is seen in Section "Asymptotic sampling variance and standard error". The statistical properties of Somers' D have been discussed, for example, by Agresti (2010), Newson (2002;2006; and Siegel and Castellan (1988), and practical procedures, for example, by Metsämuuronen (2017b).
Because of Eq. (3), Somers'   D g X tells the proportion of the logically ordered test-takers in the item after the cases are ordered by the score. This fits well with the definition by Metsämuuronen (2020a) related to IDP. As does the correlation coefficient,   D g X varies between -1 and +1. In the item analysis settings, the value   indicates the positive deterministic pattern: after ordered by the score, all the test-takers in the higher-ranked subsample(s) j in the item are (correctly) ranked higher than those in the lower subsample(s) i. The value   D g X = -1 indicates the ultimately pathological situation that all the cases in the lower subsample(s) i would be ranked higher than those in the higher subsample j. The value   D g X = 0 refers to a situation that the number of correctly ordered † It is good to note the seemingly confusing notation related to Somers' D pointed by Metsämuuronen (2020a). In the traditional settings of conditions, the direction of condition   gX usually means "g in condition of X", that is, "g is dependent on X", that is "g dependent". However, within the notation related to Somers' D,   D X g is called "g dependent" (see Metsämuuronen, 2017b;Newson, 2002;2006;Siegel & Castellan, 1988). In this article, the specific notation   D g X refers to "g dependent" which, in the outputs of some generally known software packages such as IBM SPSS as well as R libraries, would be called "score dependent". See the practical notes of this notation in relation to the estimates in Metsämuuronen (2020a).
("concordant") test-takers equals the incorrectly ordered ("discordant") test-takers and, hence, the item cannot discriminate the test-takers from each other at all. Basically, the interpretation in the magnitude of the estimates by   D g X is the same as that in gX  with the note that, in real-life datasets, gX  cannot reach perfect +1 or -1 while   D g X can.
By using a comparison with real-life items, Metsämuuronen (2020a) showed that Somers'   D g X , (D henceforward), would be a good alternative for the generally used classical estimators of IDP. This is specifically true with binary items in relation with gX  and gP  as well as the family of bi-and polyserial correlations ( BS  , PS  ) and the polychoric correlation coefficient ( PC  ) (Pearson, 1900;1913  , D relates with the known composite of items and score, and this information can be used in further analysis while PC  refers to an unknown, unreachable, and hypothetical composites that are difficult to use in the analysis. In comparison with some other directional coefficients such as Goodman-Kruskal lambda and tau (Goodman & Kruskal, 1954) or Pearson's eta coefficient ( ) (Pearson, 1903(Pearson, , 1905, D can detect the ultimate discrimination in the item while lambda, tau, and eta can detect the ultimate discrimination in the score. (Metsämuuronen, 2020a.) Although D seems to be a "superior alternative" for gX  and gP  in the binary case, in the comparison by Metsämuuronen (2020a), D appeared to face a major practical challenge relevant to polytomous items. Although D reaches the ultimate values of IDP accurately, the estimates underestimate the IDP in an obvious manner when the number of categories in the marginal distribution of the item exceeds three and when the discrimination is not perfect or near perfect (Metsämuuronen, 2020a; see also Goktas & Isci;Newson, 2002). This is elaborated in what follows.

Underestimation in D in the empirical datasets
Metsämuuronen (2020a) noted the obvious patterns of underestimation in D with real-world datasets. The underestimation is strictly related to the number of categories in the items scale, that is, to the degrees of freedom of the marginal distribution of the item (df(g) = r -1). When the number of marginal categories in the item exceeds three (df(g) > 2), gX  appears to be superior to D reflecting IDP ( Figure 2).

Figure 2. Underestimation in D in relation with
gX  (R) as a function of df(g) and 1/df(g) The right-hand side graph in Figure 2 illustrates a practical peculiarity embedded in gX  as well as in all estimators of IDP in item analysis settings, that the estimate approximates perfect 1 the less there are items in the test and the more there are categories in the items. The phenomenon is obvious when we recall that, in the measurement modeling settings, the latent variable θ is common for both the item and the score (see Figure 1), and that the association of item g and score X is determined mechanically because the score is a compound of the items. The latter was the reason why Henrysson (1963) suggested his procedure (Rir); gX  is characterized as "spuriously" inflated (e.g., Cureton, 1966, p. 93;Howard & Forehand, 1962, p. 731;Wolf, 1967, p. 21). When we think about a "test" with only one item: the correlation between the item and the "score" formed by this item, would be, obviously, perfect Correspondingly, the more we have items comprising the test score the further Pearson correlation between a single item and the score tends to deviate from 1 even if the score would explain perfectly the behavior in the item. Obviously, this phenomenon of approaching the value 1 does not make sense outside the measurement modeling settings but, in what follows, this plays a significant role in deriving the dimension correction to Somers' D.

Underestimation in Somers' D with polytomous items from the theoretical viewpoint
Although D underestimates IDP in obvious manner, the interpretation of the matter is somewhat challenging because PMC and D tell about different information of the relation of the item and the score discussed above. While  indicates covariation between the item and score, D indicates probability that two randomly chosen test-takers have the same order in both the item and score or the proportion of logically ordered test-takers in the item after they are ordered by the score. Anyhow, underestimation in D in relation to gX  is expected because of Greiner's relation (Greiner, 1909) related to the connection of Kendall Tau-a, Somers' D, and Pearson correlation discussed by Kendall (1949) and Newson (2002). Assuming two independent variables X and Y with continuous scales (implying no ties) sampled from a bivariate normal distribution, Kendall Tau

12
 , and 1  , respectively (see Figure 3). Then, in the case of two normally distributed continuous variables, except for the extreme values 1  and 0, the magnitude of XY  is greater than that of D. Consequently, because of Eq. (4), and because gX  always underestimates association, the estimates by D are expected to underestimate IDP more than gX  when the estimate by D differs from 0 and 1  and the number of marginal categories in the item is high.

Figure 3. Relation of Pearson correlation (RXY) and Somers' D with continuous variables X and Y
Because of the obvious disadvantage in D with the polytomous items to underestimate IDP even more than gX  , Metsämuuronen (2020a) suggests that a "dimension-corrected Somers' D" could be worth of deriving. While D is a "superior alternative" to gX  and gP  in binary datasets, "dimension-corrected D" could be a "superior alternative" in the polytomous cases. As far it is known, such correction has not been proposed yet. The aim of the article is to derive a dimension-corrected version of D for the measurement modelling settings to reduce the obvious underestimation

Research questions
This article derives a dimension-correction version of D for the item analysis settings. After the derivation, the following questions are asked: 1) What are the general characteristics of the new coefficient in comparison with gX 2) What is the sampling variance and standard error of the new coefficient?
3) To what extent the new coefficient produces obvious underestimates? 4) To what extent the new coefficient produces obvious overestimates?

Research design
The course of the study starts by deriving a "dimension-corrected D". This is done by modelling the error in D in 1,296 datasets with different number of test-takers (N), test lengths (k), difficulty levels ( p ), reliabilities ( ), and degrees of freedom in the item df(g) = r -1, and in the score df(X) = s -1. The datasets and items are presented in the next section.
After the derivation of the new coefficient, the asymptotic sampling variance and standard error are derived and a numerical example of the use of the coefficient is given with the comparison with the relevant benchmark coefficients.
The general characteristics of the new coefficient including the behavior in the extreme datasets, its limits as well as the potential over-and underestimation estimation are studied.
Finally, the advantages, limitations and possible ways to utilize the coefficient are discussed and suggestions for the further studies are given.

Datasets used in the derivation
The dimension correction to D is derived by using 13,392 real-world items from 1,296 datasets and the knowledge of the pattern of underestimation related to df(g) illustrated in Figure Table 1 shows the essential characteristics of the tests in the derivation. Notably, the comparatively high reliabilities of the tests with difficult and extremely difficult items (0.901-0.956) reflect the fact that the artificial datasets appeared to produce notably higher item-total correlations in comparison with the real-world datasets. This matter and its effects are discussed in Section "Main limitations of the new coefficient and the process used in derivation".
These 1,296 tests produced 13,392 items with varying item characteristics (Table 2). Notably, due to the process of forming the datasets (see Appendix), the number of items with the small degrees of freedom in the item scale (df(g ) < 4) are counted in thousands while the number of items with high degrees of freedom (df(g ) > 10) are counted in tens.

Data analysis
The data manipulation was done in IBM SPSS 25 environment. The data mining tool, Decision Tree Analysis (DTA) and related CHAID algorithm (Kass, 1980;IBM, 2011), were used in seeking the cut-offs of the variables that explained the obvious underestimation for dimension-corrected D. Manual calculations were done by using a standard spreadsheet software. The dataset of polychoric correlation coefficient comprises 5,354 items from 518 tests by balancing the item from the real-world and artificial datasets

Principles underlying the modelling of the dimension-corrected D
Based on our knowledge of the characteristics of D and gX  , underlying the process of deriving the correction elements, four main notes (N) were made and four consecutive principles (P) were followed: N1. D gives a credible estimate of IDP when df(g) = 1 (Metsämuuronen, 2020a). P1. D should be corrected only when df(g) > 1.
P2. The estimate by the dimension-corrected D should be higher than that by gX  to overcome the nature of the obvious underestimation of IDP in gX  .
N3. D tends to underestimate IDP the more the higher is df(g) (Metsämuuronen, 2020a;Newson, 2002). P3. The correction should produce more correction the higher is the df(g). However, with the deterministic patterns the correction should reach the perfect value 1.
Because there were no theoretical reasons or empirical evidence to assume that D would under-or overestimate IDP when df(g) = 1 (P1), the initial model of the expected non-underestimating value for D with the linear nature is based on the assumption that there is no need to correct the estimates in the dichotomous case. Both the assumptions of linearity of the non-underestimation and that the estimate by D would be true when df(g) = 1 are questionable and can be debated. All in all, we do not know whether the non-underestimation should be linear or curvilinear in nature. In the deterministically discriminating dichotomous dataset with an evenly distributed score, the underestimation is elliptic in nature (see Eq. 26 in "Potential overestimation in D2" below). From Greiner's relation (Eq. 4) we know that, in some cases, it is a trigonometric function. Here the linear option is selected because of its simplicity.

Modelling the dimension-corrected D
The dimension-corrected Somers' D, later called D2, is based on modeling the underestimation in 13,392 empirical values of D. Figure 4 illustrates the starting point of the modeling (cf. Figure 2). The dataset suggests that the model with cubic nature reasonably well (Figure 4). However, the model is somewhat misleading because the polynomial curve should go through the points (1/df(g) = 0, D = 1) and (1/df(g) = 1, D = 0.6284). The first point obviously indicates that, with indefinitely many categories in the item with maximal discrimination, D should reach the value 1 in the same manner as the other coefficients would do. The second point refers to the expectation of the level when df(g) = 1.

Figure 4. The original model of Somers' D and initial models D20 and D21
The correction in D is based on combining the corrected third-degree model of the observed average levels of D against 1 ( ) df g (D20, Eq. 5) and a linear model of the expected levels in varying 1 ( ) df g (D21, Eq. 6). The corrected model D20 of third grade passing through the points (1/df(g) = 0, D = 1) and (1/df(g) = 1, D = 0.6284) is: where 0.3716 = 1 -0.6284.
The magnitude of the underestimation is unknown. For the modeling purposes, the "correct" level of D (D21) was set to be linear through the points (1/df(g) = 0, D = 1) and (1/df(g) = 1, D = 0.6284) (see Figure 4). This theoretical level of D in each df(g) is y = -1,0179x 3 + 2,0096x 2 -1,3169x + 0,9542 y = -1x 3 + 2x 2 -1,3716x + 1 y = -0,3716x + 1 The average level of discrepancy between the theoretical level and the observed level at each level of df(g) is denoted by DE: and, hence, the initial correction for D is The final suggestion as the dimension-corrected Somers' D is, then, By using light algebra, Eq. (9) can be further modified into where D refers to Somers' D (g|X) and The correction in Eq. (10) that is, we first form the dimension correction for the absolute value of D as in Eq. (10) and then, if the original D is negative, we give the negative sign to the outcome. D2 appears to be very potential and its characteristics are studied in what follows.

Asymptotic sampling variance and standard error of D2
Because the statistical properties of Somers' D are well documented (e.g. Agresti, 2010;Newson, 2002Newson, , 2006Newson, , 2008Siegel & Castellan, 1988) the behavior of D2 is known in the case of df(g) = 1. In the dichotomous case, the asymptotic sampling variance of D2 can be approximated as and, under the hypotheses of independent variables, where nij is the number of cases in the cell i,j, and ni is the number of test-takers in the row category i, and Note that the formulae (13) to (16) use double than "usual" size of magnitude for P and Q seen in Eq. (3). These calculations are somewhat laborious manually. Somers (1980) offers a short-cut method found also in Siegel & Castellan (1988) Notably, the simplified approximation of sampling variance depends only on the dimensions of the variables. Hence, for all combinations of response patterns with the identical dimensions in the crosstabulation, sampling variance and related sampling error are identical.
To deriving the corresponding sampling variance for the case of df(g) > 1, we remember that, because of Eqs. (10), (12), and (11), after simplified, Then, by using the basic laws of variance, we get where A is as in Eq. (11). Then, and, under the hypotheses of independent variables, and, if using the simplified short-cut by Somers (1980), Notably, the element   2 1 1 A   always and, hence, the sampling variance and standard error of the estimates by D2 are always smaller than those by Somers' D. When testing the null hypothesis This value is approximately normally distributed with mean 0 and standard deviation 1 when the null hypothesis is true.

A numerical example of D2
As a numerical example of calculating D2, assume a simple polytomous dataset with N = 25 cases as in Table 3 adapted from Cox (1974, p. 177) and Drasgow (1986, p. 70). Let us assume that the dataset would concern an item g and the score X. Table 3. A hypothetic dataset (Cox, 1974;Drasgow, 1986) Used by permisson of Biometric society  Table 3 X 69 72 77 78 80 81 85 86 87 88 92 93 96 99 101 103 104 108 112  In the first phase, Somers' D is calculated. For this, a cross-table is formed (Table 4). For the manual calculation of Somers' D, the sums of concordant pairs (P) and discordant pairs (Q) are formed (see Siegel & Castellan, 1988;Metsämuuronen, 2017b). For these, the cell frequencies are denoted by nij. For the concordant pairs, we calculate how many cases are there in the cells below and to the right of the cell nij. These are denoted by By using Eq. (3), the estimate of the association by Somers' D("score dependent"), that is, D("g in condition of X") ‡ is although without the obvious overestimation (see Figures 5 and 6). Third, the higher is df(g) the greater the correction is in D2. Fourth, D2 does not correct D when item discrimination is deterministic and D = 1. Of the 13,392 items on the simulation, none showed a value that was out of range regarding the limits of correlation.

Figure 5. Average estimates of selected indices of IDP by varying df(g) ‡ Again
, it is worth noting the specific wording when it comes to textbooks and outputs related to Somers' D. All the generally known textbooks and software packages use the term "score dependent" for this formula. However, it tells us how well the item discriminates the test-takers after they are ordered by the score, that is, the order in the item depends on the order in the score. Overall, when it comes to correcting the underestimation in D, D2 behave logically at all levels of df(g) used in the simulation. On average, D2 underestimates the IDP remarkably less than Somers' D, and notably less than gX  and PC  as was the motivation for the derivation. We may also note that the average magnitudes of the estimates by D2 tend to follow those of PC  when df(g) < 3-as was the case in the numerical example with Table 3. Notably, however, in the simulation dataset, PC  tends to start to follow the magnitude of gX  when df(g) > 6. This indicates that PC  tends to start to underestimate the IDP the same way gX  does with high degrees of freedom in the item. To some extent, the average magnitudes of the estimates by D2 tends to follow those by PS  , although without the obvious overestimation (see Figure 6). Notably, the variability in the magnitudes of the estimates by D2 is smaller than that by D at each level of df(g) > 1 (Figure 6).

Limits of D2
When df(g) = 1,       The limit of degrees of freedom related to D2 reflects the peculiarity of the magnitudes of the estimates of IDP as discussed above. It is worth noting that the dimension-corrected coefficient is created for the case that the degrees of freedoms of two variables are far from each other. In the theoretical extreme case when df(g) = ∞, that is, with the continuous items and infinite number of test takers with different item score (to form infinite number of categories in the item), and, then, the correction in Eq. (10) leads us to a triviality that     2 1 1 1 1 D D A       seemingly regardless the actual association between the item and the score. However, the indefinitely long "parallel tests" approximates the ultimate magnitude of gX  = D2 = 1. Hence, within the item analysis settings, with the indefinitely many categories in the item(s), the score would contain also indefinite number of categories and, then, D approximates the magnitude of 1. However, Eq. (25) hints that when two continuous variables with different scales are independent from each other, another kind of correction than Eqs. (10) and (12) may be needed. This restriction of D2 is necessary to keep in mind if applying it to items with continuous scale with infinite number of categories. However, we may remember, that the continuous scale itself alone does not lead to triviality of D2 = 1 because, even with the continuous values in the scales, the number of categories in the item may be small and then, obviously, df(g) << ∞. This matter is relevant in relation to the measurement modelling settings where the items may be weighted by a factor loading. Regardless the seemingly continuous scale, the actual weighting of, for example, binary items leads to two categories; now instead of categories 0 and 1, we may have categories 0 and 0.678, as an example. Another viewpoint to this restriction of using D2 is that the contemporary procedures related to item analysis are usually related to non-continuous scales in the item. Hence, the condition of df(g) = ∞ is a highly theoretical option and does not relate with the real-life item analysis settings as we face those today.

Obvious underestimation in D2
A simple criterion for the obvious underestimation in the estimates by D2 is whether the magnitudes of the estimates are lower than those by gX  . Knowing that the estimate by gX  is practically always an underestimate for IDP, lower values would strictly be indicative of even more underestimation in D2. Of the 7,131 items on the simulation with df(g) = 1, the original D (= D2) included 12 cases (0.1%) where gX  > D. All these cases came from the artificial datasets with relatively high value of gX  (see Table 5). The groups and cut-offs were suggested by Decision Tree Analysis (DTA; IBM, 2011). Each factor was analyzed individually by using CHAID algorithm (Kass, 1980) without further restrictions.
When df(g) > 1, additionally, we find 36 additional estimates (0.3%) by D2, where the magnitude of the estimate by gX  is higher than that by D2. All these obvious underestimates by D2 (0.4% of the estimates) come from the artificial dataset with an artificial combination of high item discrimination and low item difficulty. As a benchmark, with the original Somers' D when df(g) > 1, as many as 62% of the estimates in the simulation datasets are obviously underestimated. Hence, the number of the clearly underestimated estimates by D2 seems relatively low. Some of the characteristics of the obvious underestimates are collected in Table 5. It seems that the probability of obtaining obvious underestimation in real-life datasets is very low when using D2.

Potential overestimation in D2
If the magnitude of the estimates by D2 would be higher than 1, those would be obvious overestimates. In the simulation, none of the items showed this behavior. Otherwise, possible overestimation is not easy to evaluate in strict terms when using real-world datasets. One potential criterion for the overestimation in these cases is the theoretical, maximally discriminating Guttman-patterned datasets (Guttman, 1950). In the Guttman pattern, with df(g) = 1, D gives the maximal estimate 1 while the estimates by gX  are always smaller than 1; the maximal gX  is reached when p = 0.5.
Assuming a score without ties, the highest value of item-total correlation approximates max 0.866 gX   (see Metsämuuronen, 2016) and, hence, the lowest point of the difference is 1 0.866 0.134 . This boundary is illustrated in Figure 7.

Figure 7. Guttman-pattern as a limit for the possible overestimation
In the binary case, Guttman boundary follows an ellipse with the parameters x0 = 0.5, y0 = 0, a = 0.5 and b = max 0.866 gX   : where p is the item difficulty and 0.866 refers to the limit of the maximum value of Pearson correlation in the deterministic pattern in the dataset. From (26) we solve gX  : and, then, in Guttman-patterned items, This model is used as a rough tool to evaluate the possible overestimation in D2 (Figure 8). In the real-world datasets in the simulation, 18 out of 13,392 estimates by Somers' D (0.13%) exceeded this limit, and, in the artificial datasets, 33 (0.25% of all items). In all these cases, the magnitude of the overestimation is nominal (near zero units of correlation).
Notably, in comparison with the original D, D2 produced only one additional estimate with non-significant magnitude that exceeded the boundary of the Guttman pattern.

Conclusions
A dataset of 13,392 real-life items with varying characteristics was used to model the underestimation in D and to derive the "dimension-corrected Somers' D" for the measurement modelling settings. In its general form, the new coefficient is "item in condition of score" or "score dependent" in the standard outputs of the software packages) and is the number of marginal categories in the item minus 1. Within the normal range of non-pathological item discrimination, that is, with positive association between the item and score, D2 equals Somers' D in two cases: when df(g) = 1, that is, in binary datasets, and when D = 1  , that is, with deterministic item discrimination. As do all the classical estimators of IDP, D2 approaches the value D2 = 1 when the number of categories in the item scale approximates the scale of the score. Additionally, in a highly theoretical case of infinite number of categories in the item (and, consequently, in the score), D2 approximates D2 = 1 seemingly regardless the actual value of Somers' D. Under this condition, however, also D2 (as well as all estimators of IDP because of the mechanical connection between the items and the score) approximates 1.
In the datasets in the simulation, D2 showed very few cases of obvious underestimation and overestimation. The correction is simple but seems to get an effective result. With certain restrictions discussed in the section "Main limitations of the new coefficient and the process used in derivation", D2 seems to be superior over other indices in comparison not only in binary cases but also in cases where the degrees of freedom increase up to 15 categories; more categories were not used in the simulation.
Overall, D2 corrects the underestimation in D effectively and hence, in most cases, the magnitude of the estimates expectedly draws us nearer the real IDP that those by gX  . The number of obvious cases of underestimation by D2 is reduced remarkably in comparison to the original Somers' D-from 62% to 0.3% of the estimates with df(g) > 1. . In most of these obvious underestimations, the magnitude was close to zero units of the correlation. The number of estimates with a possible overestimation did not increase when the boundary of the deterministically discriminating Guttman pattern was kept as the criterion. The possible overestimation in the dimension-corrected D may need more studies though. Other limitations of the new coefficient are discussed in the section "Main limitations of the new coefficient and the process used in derivation".

Some advantages of D2
Combining the advantages of Somers' D from Metsämuuronen (2020a) and Newson (2002) as well as the empirical findings in this article, the dimension-corrected Somers' D could be proposed as one of the "superior alternatives" to   . These kinds of datasets where the order of the test-takers in the item is the same as in the score are more frequent with small datasets relevant in, for example, classroom testing settings. In these patterns, unlike the other estimators, 2 1 D  always irrespective of the number of cases, degrees of freedom of the item and the score, the number of tied values, difficulty levels in the items, or the number of items on the test. 9. D2 is reasonably easy to calculate even manually in practical test settings such as classroom testing, while calculation of PC  requires specific software packages and complex procedures.

Main limitations of the new coefficient and the process used in derivation
One obvious challenge in generalizing the new coefficient is that D2 is developed for item analysis settings. In these settings, always df(g) << df(X), and the items and the score are mechanically connected. Notably, the dimensioncorrection leads, automatically, to approximate the perfect value D2 = 1 (or, in the ultimate pathological case, to D2 = -1) when the item is a continuous one and the sample size is large. Because of this, the applicability of D2 may be reduced outside the measurement modeling settings. Hence, it is not wise to use D2 as a general coefficient without further studies and possible amendments. The coefficient is suitable for the negative values of D though, however, these are pathological cases in item analysis settings.
Second, during the process, the benchmark of the possible underestimation was the Pearson's product-moment correlation coefficient while, perhaps, some other coefficient would have been more appropriate. Anyhow, the correction seems to bring us nearer the true IDP also in comparison with other indices. More studies are needed in this respect. Specifically, from this viewpoint, an interesting benchmark would be a coefficient called r-polyreg correlation, that is, an r-polyserial estimated by regression correlation (Lewis et al., 2003cited by Livinstone & Dorans, 2004. This coefficient, developed to overcome the challenge of obvious overestimation in BS  and PS  , can be used with binary or polytomously scored items and it produces estimates that do not exceed 1, nor does it rely on bivariate normality assumptions (Moses, 2017).
Third, the correction elements in 2 D are based on simulation with empirical items that embed the limitations of the original datasets to a certain extent. We do not know how much the estimates depend on the original dataset. However, we note that there are no numerical sub-coefficients in the correction factors in Eqs. (10) and (11). Hence, to some extent, the new coefficient is free from the original dataset and the correction is more general than is the case when it includes specific numerical coefficient(s) strictly dependent on the underlying dataset. Seeing that the values arrived at are based on 13,392 items with varied characteristics and a strong base in the real world, the estimates are likely to be quite stable in relation to real-life settings of testing, although wider simulations may give more insights in the matter.
Cross-validating the model by using datasets from the same basic population and same test items would not challenge the models profoundly. Specifically, such simulation where the degrees of freedom of the item are higher than seven would enrich our knowledge of the coefficient; the dataset used in the simulation in this article contained few items of these kind. Also, simulations regarding the possible over and underestimation of association in general would benefit us.
The fourth limitation is that dimension-correction is modeled for Somers'   D g X and not for Somers'   D X g or for symmetric D. Hence, the correction cannot necessarily be generalized though it may carry general elements for df(X) >> df(Y) or df(Y) >> df(X). From the measurement-modeling viewpoint, however, the direction   D g X ("item in condition of score") is more relevant than   D X g ("score in condition of item"). In any case, generally, it would be valuable to study whether the correction elements developed in this study are valid also in the symmetric case and in the case of   D X g . It may appear that when the degrees of freedoms of the variables are nearer each other we may need the degrees of freedom of both variables in the correction-now, only df(g) appeared to be significant factor in the correction.

Some suggestions for the further studies
One natural direction for the further studies is to study the new coefficient itself. First, larger simulations would confirm the characteristics of the new coefficient. While there is a need for a simulation with higher degrees of freedom than seven to see how much a small number of estimates affected the correction elements at this range, simulations are also needed to confirm or alter the coefficient in case the degrees of freedoms are close to each other.
Second, being a new index of correlation related to item discrimination, it would be valuable to compare the characteristics of the new coefficient with some other, new, well-behaving coefficients, such as r-polyreg correlation.
Third, being a new coefficient of association, its properties may be valuable to study from that viewpoint as well. We may also ask: does the coefficient carry the essential characteristics of Somers' D at all or should it be taken as a totally new coefficient based on Somers' D?
Fourth direction for future research is to study the new coefficient in relation with other relevant aspects of measurement modeling. Then, the new coefficient may have relevance when estimating "dimension-corrected reliability" of the test score, for example. Item-total correlation, which always underestimates the connection of the score and the item, is embedded in all widely used estimators of reliability because, in the classical forms of reliability, the element 2 X  can be expressed by using  (Lord & Novick, 1968), where k refers to the number of items, booklets, or partitions of the test items. This matter concerns such classical estimators of reliability as Spearman-Brown prophesy formula (Brown, 1910;Spearman, 1910), Flanagan or Flanagan-Rulon formula (Flanagan, 1937;Rulon, 1939), the family of Guttman's Lambda (Guttman, 1945) as well as the classical formula KR20 by Kuder and Richardson (Kuder & Richardson, 1937), and its generalized version coefficient alpha (timewise Guttman, 1945;Gulliksen, 1950;Cronbach, 1951). As the magnitude of gX  is always lower than it should be, Metsämuuronen (2016) argued for that this mechanical underestimation is at least one of the reasons why the classical coefficients tend to underestimate reliability. We may note that item-total correlation is embedded also in the processes of calculating more advanced estimators of reliability based on factor analysis such as McDonald's Omega (McDonald, 1999) and maximal reliability (e.g. Li, 1997;Raykov 2004;2005 onwards) because factor loadings in orthogonal rotation are (Pearson) correlations between the (weighted) items and the (latent) factor. This means that the very essence of factor loading is item-scale correlation. Perhaps D2 could be used instead of Pearson correlation (or some other estimator) in these formulae and procedures. This may lead us to correct the estimates obtained by the classical estimators such as coefficient alpha and maximal reliability and, hence, we can get nearer the real reliability than we can by using the traditional estimators or at least this can give us the "dimension-corrected reliability".
Fifth, the directional nature of the coefficient and its possible usefulness within the modern measurement modeling processes may be worth studying. The nondirectional Pearson product-moment correlation coefficient and the family of polychoric correlations are deeply set in the procedures in EFA and SEM analyses. A relevant underlying question that arises from the directional D2 and the underlying Somers' D is why in the first place are we willing to use the nondirectional correlation coefficients in our testing and measurement modeling settings while the whole philosophy of measurement modeling is based on the idea of directionality the latent trait manifest as the score or the measurement scale determines the observed behavior and not the other way round (e.g. Byrne, 2001;Metsämuuronen, 2017b): in psychometric theory, the overall trait being measured generally drives examinees' responses to, and, thus, scores/measurement scales on individual items (see the discussion in Metsämuuronen, 2020a). Then, the family of the directional coefficients of correlation seems to be at least possible if not suggestible alternatives for measurement modeling. The directional, dimension-corrected correlation coefficient D2 could be a relevant option to consider from this point of view.
Overall, Somers' D seems to be a very potential tool within measurement modeling settings because of its natural characteristic of directing the connection of two variables the same way as we find in the settings of structural equation modeling. With dimension-correction, D2 could be an even more useful tool in both item analysis settings and in measurement modeling. It may help us get closer to the real connection between the latent and manifest variables, real item discrimination, and real reliability.