Generalized Discrimination Index

Kelley’s Discrimination Index (DI) is a simple and robust, classical non-parametric short-cut to estimate the item discrimination power (IDP) in the practical educational settings. Unlike item–total correlation, DI can reach the ultimate values of +1 and ‒1, and it is stable against the outliers. Because of the computational easiness, DI is specifically suitable for the rough estimation where the sophisticated tools for item analysis such as IRT modelling are not available as is usual, for example, in the classroom testing. Unlike most of the other traditional indices for IDP, DI uses only the extreme cases of the ordered dataset in the estimation. One deficiency of DI is that it suits only for dichotomous datasets. This article generalizes DI to allow polytomous dataset and flexible cut-offs for selecting the extreme cases. A new algorithm based on the concept of the characteristic vector of the item is introduced to compute the generalized DI (GDI). A new visual method for item analysis, the cut-off curve, is introduced based on the procedure called exhaustive splitting.


Item discrimination power as a phenomenon and the underlying statistical model
In the general sense, the item discriminating power (IDP) is a loose term for the characteristic of a test item to reflect how accurately or efficiently an item can discriminate between the test-takers with the higher item response from those with the lower item response (see ETS, 2020;Liu, 2008;Lord & Novick, 1968;MacDonald & Paunonen, 2002). In achievement testing with multiple-choice questions resulting binary items, as an example, we ask how accurately the item can make a difference between those test-takers who gave the correct answer and those who gave an incorrect answer. Within test theory and test construction, item discrimination power is one of the two to five item parameters that characterize the test items (e.g., Lord & Novick 1968). Two other parameters most commonly used are item difficulty and pseudo-change score level (or guessing), while the fourth and fifth parameters are rarely in the practical use (see the discussion in Balov & Marchenko, 2016;Barton & Lord, 1981;Loken & Rulison, 2010;Metsämuuronen, 2017).
For convenience, we assume that 00     and RS     , and 1 ir g g g  and 1 . js x x x  as illustrated in Figure 1. From the correlation viewpoint related to the item analysis with polytomous items, the inferred correlation between two latent variables is polychoric correlation (   ), the inferred correlation between the latent  and observed X is polyserial correlation ( X   ), the observed correlation between the interval-scaled variables g and X is point-polyserial correlation, that is, the traditional item-total correlation ( gX  ), and the observed correlation between the ordinalscaled variables g and X is rank-polyserial correlation ( RP  ). From Kelley's discrimination index viewpoint focused in the article, the middle categories in the score are omitted in the analysis.

Selected indices of item discrimination power
The indices of IDP reflect the relationship between an item and a trait of interest (Moses, 2017). During the history of test theory, many indices of IDP have been created and developed (see the comparisons by Cureton, 1966aCureton, , 1966bETS, 1960;Liu, 2008;Metsämuuronen, 2020;Oosterhof 1976;Wolf, 1967). Recently, Metsämuuronen (2020), as an example, studied the efficiency of nine frequently discussed classical indices. Though the contemporary theoretical literary of item parameters has, mainly, concentrated on the different aspects of the modern test theory, that is, on item response theory (IRT) modeling (e.g., Balov & Marchenko 2016;Cechova, Neubauer, & Sedlacik, 2014) and the largescale assessments (e.g., PISA, TIMSS, PIRLS, PIAAC) use mainly IRT modeling in the analysis (e.g., Aslan & Aybek, 2019;Esendemir & Bindak, 2019), many of the classical indices are still in wide use in the practical item analysis settings.
Two widely used classical indices for IDP are item-total correlation (Rit) ( gX  ; based on Pearson, 1896) and item-rest correlation (Rir, proposed by Henrysson, 1963 and supported by Cureton, 1966b) also known as "corrected item-total correlation" (e.g., in the outputs of IBM SPSS and STATA software packages). These are defaults in widely used general g X 1 2 R-1 R 1 2 S -1 S n 1 1 n 2 1 n R 1 n 1 2 n 2 2 n R 2 n 1 S-1 n 2 S-1 n R S-1 n 1 S n 2 S n R S statistical software packages such as IBM SPSS (e.g.,IBM, 2017), Stata (e.g.,Stata corp., 2018), and SAS (e.g., Yi-Hsin & Li, 2015). Here, we note the mechanical connection of Rit and reliability; because Rit is embedded to coefficient alpha (see Eq. 1), and because alpha is the most widely used estimator for the test reliability in the practical settings (see the worry of its too wide use by Dunn, Baguley, & Brunsden, 2013;Graham, 2006;Green & Young, 2009;Hogan, Benjamin, & Brezinski, 2000;Trizano-Hermosilla & Alvarado, 2016;Yang & Green, 2011), Rit may be the widest used of all the indices of IDP-though not always consciously.
Both Rit and Rir are, essentially, Pearson product-moment correlation coefficients-the first between the item and the total score and the latter between the item and the score where the interesting item is omitted. Both embed challenges in the measurement modelling settings. Metsämuuronen (2016;see also 2020) showed that the item-total correlation always underestimates the IDP when the scales of the item and the score differ from each other, and this underestimation may be grave when the difficulty level of the item is extreme. Metsämuuronen (2017) showed that, paradoxically, the "corrected" item-total correlation underestimates IDP even more than the "uncorrected" item-total correlation; this is obvious because the magnitudes of the estimates by Rir are always lower than those by Rit. The relevant question is what could be a better index for Rit and Rir within the classical toolbox? After comparing nine indices (Rit, Rir, bi-and polyserial correlation, polychoric correlation Goodman-Kruskal Lambda and Tau, Pearson Eta, and Somers' D), Metsämuuronen (2020) suggests Somers' D (Somers, 1962) as one of the "superior alternatives" to Rit and Rir in the binary dataset.
Another kind of possible "superior alternative" to Rit worth of studying is Kelley's Discrimination Index (DI;Kelley, 1939) suggested to classroom teachers during the years, among others, by Ebel (1954aEbel ( , 1954b, Educational Testing Service (ETS, 1960), Wiersma & Jurs (1990), Mehrens & Lehmann (1991), and Metsämuuronen (2017). DI is one of the short-cut methods for the practical testing settings because of its simplicity related to estimation without sophisticated statistical tools (see Cureton, 1966a). DI and it's generalized version GDI are in focus of this article.

Why is item discrimination power important and why is Kelley's DI interesting?
Of the item parameters, discriminating power is an interesting characteristic of the test item because it has a strict connection to the test reliability. We remember that Lord and Novick (1968, p. 344) introduced a modification of the alpha coefficient (α) in which the index of IDP (Rit = gX  ) is embedded: where 2 g  refers to the item variances and k is the number of items. This coefficient is algebraically identical with the classical formula of the coefficient alpha published in Gulliksen (1950) and Cronbach (1951) based on works by Kuder and Richardson (1937), Flanagan (1937), Rulon (1939), and Guttman (1945). Further, if we take the derivation of Kuder and Richardson seriously -where we assume parallelism of the items (see the critic of using the classical forms by, e.g., Tarkkonen, 1987 andVehkalahti, 2000)-the less used classic estimator of reliability by Kuder and Richardson (1937), KR21, gives us a rough estimate for reliability with minimal factors: The simplified formula means that the only factor determining the magnitude of estimate of reliability, except the number of items (k), would be the magnitude of item discrimination (Rit = gX  ). From this perspective, Ebel (1967) based on Stanley (1964), provided us another kind of estimator of reliability that combines the alpha type of estimator and Kelley's DI: where DI refers to Kelly's Discrimination Index. The value of Ebel's estimator seems to be, in many cases, lower than that of coefficient alpha. Hence, when knowing that the coefficient alpha underestimates always the real reliability, Ebel's estimator seems to underestimate reliability even more. Maybe this is the reason why the latter formula is not in a practical use.
All in all, to give a rough estimate of reliability of the test, the only things needed-in addition to the number of the items and item variances-are the estimates for item discrimination. In the practical settings of compiling the test from single items, the more items with the high discriminating power we select to the test, the more discriminating the test would be and, contrastingly, the items with very low item discrimination power are usually omitted from the final compilation to raise the reliability. Therefore, the indicators of IDP and the estimates they produce are interesting from the general viewpoint of reflecting the accuracy of the whole test.
DI has some positive characteristics over Rit and Rit why it turns to be an interesting coefficient to study more: (1) It can detect the deterministic patterns and reaches the value DI = 1  correctly while Rit and Rir cannot reach the ultimate value and, hence, they underestimate the association between the item and the score always. Hence, (2) DI does not necessarily obviously and mechanically underestimate the association between the item and score as Rit and Rir do in the case when the scales of the item and the score are not equal.
(3) DI is easy to calculate in the practical settings related to item analysis, and (4) unlike the coefficients based on Pearson product moment correlation coefficient (Rit and Rir), as being based on the order of the order of the score, it is robust against changes in the dataset and outlier values. Hence, when DI may be studied as a "superior alternative" to Rit in the binary case, generalized DI could be used in the polytomous cases.
The main limitation in DI is that it is restricted only to the binary datasets. Also, the traditional way of using DI limits its use to fixed cut-offs of the extreme values in the analysis. This article generalizes DI to the polytomous dataset with limitless cut-offs.

Research questions
This article discusses the possibilities of Kelley's DI as a useful tool in the practical educational testing settings for item analysis. The first part of the article asks how stable the estimates by DI are in comparison with those by Rit. This is illustrated by using a simple example with deterministically discriminating Guttman-patterned items (Guttman, 1950). The example illustrates also the basic difference in the estimates by Rit and DI: while Rit cannot reach the reach the ultimate value Rit = 1  in the real life testing settings, DI can reach the value DI = 1  .
The latter part of the article derives the generalized version of DI that can be used with binary and polytomous items with multiple cut-offs. This part answers the following research questions: (1) What the characteristics of generalized DI (GDI) are; (2) How DI and GDI can be calculated by using new computational algorithms; (3) How GDI can be used in the practical educational testing settings; and (4) How a new graphical method of visualizing the item discrimination power could be used in the practical testing settings in locating latent difficulty level, non-logical test behavior, as well as stableness of the estimate of the item discrimination power.

Methodology
The treatment in the article is mainly theoretical and conceptual. Hence, specific methodological tools are not in use.
Within the course of study, a new kind methodology is developed for the practical testing settings to be utilized in analyzing both the dichotomous and polytomous items.
The course of the study starts by introducing the original Kelley's DI. In this section, the peculiarity of DI of not using all the test-takers in the analysis with the rationale of using different cut-offs is discussed. Here, also, DI is compared with Rit by way of example to illustrate the extreme values in DI and how small changes do not change the value in DI while they always do in Rit.
Generalized DI is derived in the next main section. This requires new operationalization of the traditional notation related to DI. Some numerical examples of using GDI are discussed and new computational algorithms are provided to calculate DI and GDI.
Finally, a new visual method, cut-of curve, related to GDI for illustrating the item discrimination power and item difficulty is introduced as further elaboration of GDI. The new method is based on exhaustive splitting procedure (ESP) of the dataset. Some practical hints are given to the practical users how to use the tool.

Kelley's DI
Though DI is quite an old innovation-originally created for validation of items (chronologically, Long & Sandiford, 1935;Kelley, 1939;Johnston, 1951)-it is still in use in the practical item analysis settings especially in the educational settings (see some examples in Appendix). In some rare works, DI has been connected to Rasch-and IRT modelling (e.g., Bazaldua, Lee, Keller, & Fellers, 2017;Kelley, Ebel, & Linacre, 2002;Tristan, 1998) and Bayesian inference (e.g., Batanero, 2007). However, in general, DI is not widely handled in the contemporary theoretical writings. Nevertheless, DI may be in a semi-wide practical use because it is specifically suggested for teachers by leading authors during the years, and because it is very easy to use in in such environments where sophisticated software packages for items analysis are not in use as discussed above.
In comparison with other indices of IDP, the calculation of DI embeds peculiarity that it uses only the extreme cases in the estimation. Though the rationale of not using all the cases in the estimation is not obvious, by using DI, we ask the same essential question as with the other indices: how well the test item can discriminate between the lower-and higher-performing test-takers. Because it is difficult to discriminate the medium-range cases from each other and, hence, they may confuse the possible interpretations, the logic behind DI to compare only the extreme cases seems reasonable. Because of the mechanism of selecting the extreme cases to the analysis, the different cut-offs for the extreme groups have been widely discussed during the years: in the early phase by, for example, Long and Sandiford (1935), Kelley (1939), and Forlano and Pinter (1941) and later by, for example, Cureton (1966a), D'Agostino and Cureton (1975), Ebel (1967), Feldt (1963), Ross and Lumsden (1964), and Ross and Weitzman (1964).
Kelley's DI is traditionally calculated by using the following procedure. Assume a test with N test-takers ordered by the score (X). The test-takers are divided into two groups consisting, traditionally, only the highest and lowest 25% (e.g.,D'Agostino & Cureton, 1975;Mehrens & Lehman, 1995;Metsämuuronen, 2017) or 27% (e.g., Ebel, 1967;Feldt, 1963;Kelley, 1939;Ross & Weitzman, 1964) of the test-takers. These cut-offs are denoted by the upper fourth (U) consisting the highest scoring test-takers and lower fourth (L) consisting the lowest scoring test-takers. By using this notation, assuming a binary item, DI can be expressed as follows: (e.g., Metsämuuronen, 2017, p. 125) where R U and R L refer to the number of correct answers in the upper and lower fourth of the ordered dataset, and T refers to the total number of observations in the two parts together. Consequently, p U and p L refer to the proportions of correct answers in the upper and lower part of the reduced dataset, and p is the proportion of the correct answers in the reduced dataset.

Possible cut-offs and a more general notation
The discussion of different cut-offs (see the literature above) is relevant from the viewpoint of the generalized discrimination index introduced in the latter part of the article. The reason for the 25% or 27% cut-offs is that in a normal distribution, the 27% cut-off maximizes the difference in the population and, hence, is considered statistically better than the 25% cut-off (Kelley, 1939, Wiersma & Jurs, 1990). In the original derivation, Kelly assumed normal distribution of the score and 50% of passes for the entire item (i.e., p = 0.50). If the difficulty level of the items would differ from p = 0.50, Kelly wrote: "The proportions undoubtedly would not be twenty-seven per cent from the extremes" (p. 70). Forlano and Pinter (1941)-after studying the cut-offs of upper and lower 50%, 33%, 27%, 16%, and 7%concluded that no method occupies the first rank. However, they preferred 27% because it is a simple and rapid, rough and ready method. Also, Feldt (1963)-after showing that 27% yielded the most precise estimate of the tetrachoric coefficient only when the population correlation was close to zero-suggested no change in the traditional 27% because it yield highly efficient estimate. All in all, during the history, many different cut-offs have been discussed. Hence, Metsämuuronen (2017) proposed a general notation of DI: where pi refers to the proportion of correct answers in a specific cut-off i, and L i p refers to the proportion of correct answers in the lower (L) part of the cut-off i. This kind of notation seems relevant in procedures where many if not all the possible cut-offs are used in the analysis.

Stableness of the estimates by DI in comparison with those by Rit
Because of being robust statistics based on the order of the test-takers, DI seems to produce quite stable estimates for IDP. This is discussed by a theoretical example related to so-called Guttman-patterned items (Guttman, 1950;Linacre & Wright, 1996) and minor stochastic error illustrated in Tables 1a and 1b. The ultimately Guttman pattern is a theoretical structure of a dataset where the deterministically discriminating test items form a triangle type of dataset with different difficulty levels with a string of 0s followed by a string of 1s when the cases are ordered consecutively by their total score. Here, the items are called Guttman-patterned when the response pattern is formed with a string of 0s followed by a string of 1s when the cases are ordered consecutively by their latent trait even though the dataset would not be triangle-formed.
Assume a dataset of a hypothetical test with n = 15 test-takers and k = 4 binary items with the deterministically discriminating nature (Guttman pattern) with as in Table 1a. Typical for this theoretical form is that the items discriminate the (hypothetical) test-takers with a higher score in a deterministic manner from those with a lower score. Then, the score explains the behavior in the item in the deterministic manner, and we would expect to see the perfect explaining power ( 2 1   ) and, consecutively, perfect correlation ( 1   ).   Table 1a we note that the item-total correlations (Rit = 0.81-0.94) are reasonably high. The latter indicates high item discrimination though the estimates do not reach the perfect 1; the algebraic reason for this underestimation is formalized in Metsämuuronen (2016). The corresponding values of DI vary from 0.5 to 1. It is worth noting that in two out of four items, DI detects the deterministic pattern in the items. Hence, DI can reach the ultimate value of +1 (as well as -1) correctly while Rit always underestimates the IDP in the real-life testing settings.
Let us assume that two of the test-takers were marked incorrectly (or they, unexpectedly given their ability level, gave correct and incorrect answer) in item1 as in Table 1b. Although the difficulty level of the items (proportion of correct answers, p) did not change in the process, the magnitude of the estimates by Rit decreased in all items (up to 0.22 units of correlation) even though there were no changes in items 2, 3, or 4. In contrast, with DI, though the magnitude of the estimate for item1 has reduced from 1.00 to 0.75 (0.25 units of correlation) the magnitude of the estimates in items 2, 3, or 4 did not change because the order of the test-takers did not change in the process.
The stable character of DI is caused by three reasons. First, because the middle-range observations are not used during the calculation of DI, the changes in those observations do not change the estimate of IDP. Second, because the correct and incorrect responses can be in any order within the cut-off, DI is more robust for the changes of the item structure than Rit. Third, because DI uses the score only to order the test-takers, the changes in the actual score do not necessarily affect the value of DI of the remaining items if the order of the test-takers did not change radically.
Though DI produces stable estimates, and it can reach the ultimate values correctly, it has two main deficiencies. First, when using the traditional cut-offs of 27% or 25% , IDP may be underestimated radically when the item difficulty is extreme (0.20 > p > 0.80; see also Tristan, 1998) as seen with item4 (DI = 0.50 while Rit = 0.812). With items with the extreme difficulty level, it would be wise to use either another index for IDP or to use another cut-off than the traditional 27% or 25%. All in all, though DI can detect the deterministic patterns, it is good to note that there would be better options than DI to detect the deterministic patterns. One of these is the nonparametric and directed coefficient of correlation Somers' D (Somers, 1962;Metsämuuronen, 2020). In the case of Table 1a, Somers' D would detect the deterministic pattern for all the items. Second, maybe more crucially, DI is developed only for the binary items. The next section generalizes DI to allow polytomous responses and several cut-offs.

Generalized DI
Two things are worth pointing from the previous discussion concerning DI: first, the classical form of DI can be used only with dichotomous items and, second, the cut-offs for DI are not deterministically fixed. Hence, generalized DI to allow polytomous responses and several cut-offs is discussed and derived in this section. Here the suggestion is given based on the original DI. We may note that Brennan (1972) introduced another kind of generalized upper-lower item discrimination index based on Kelly's DI. Brennan's B is generalized in the sense that the cut-off need not be symmetric. However, Brennan's B is still restricted to dichotomous items and uses a fixed cut-off. Harris and Wilcox (1980) showed that Brennan's B equals algebraically to Peirce's Theta discussed by Goodman and Kruskal (1959).

General notation for Generalized DI
To formalize the generalized DI, a slightly modified notation and radically different operationalization of the symbols are suggested. In a general case, in a specific cut-off a, DI can be written as follows: Metsämuuronen, 2017, p. 125, that uses the symbol i instead of a) where the subscript a refers to the symmetric cut-off used in the estimation. Assume that we have 20 test-takers and we use 25% cut-off in the calculation of GDI. Then, a refer to (1) the percentage of the cut-off, such as a = 25%, (2) the number of cases in the specific cut-off, such as a = 5, and (3) to the actual rank-order of the cut-off, such as a = 5 = 5 th case, where 5 refers to the rank of the very testtaker in the ordered dataset that benchmarks the 25% cut-off.
At this point, it is good to raise a potential challenge in the calculation of GDI (as well as DI). Forming the order for DI and GDI is based on the idea that the test-takers are ranked in a uniquely unambiguous manner, that is, each test-taker has a separate rank order. However, this is not possible when there are ties in the score because, within the tied cases, we do not know the actual order of the test-takers without some relevant rationale. To solve this challenge, an option, relevant within the achievement testing, is suggested to be considered. To acquire the unambiguous order for the cases, the test-takers are double-ordered, first, by the score and, second, by the items-or by other relevant information such as the time used in the task. In the latter ordering, the more difficult items (or less used time in task) are given more weights. The rationale behind the suggestion is that the same score, except in the case of identical profile of answers, is an outcome of a compilation of different items. We may think that the test-taker who was able to solve more demanding tasks (or by using less time in task) showed slightly higher achievement than the test-taker with the same score but by solving less-demanding tasks (or by using more time in task). This rationale is embedded to the estimation of IRT modeling; however, its routines are not fully used even in those settings. Another option to solve the challenge, a practical one though less accurate, is to trust the randomization in the ordering: all cases are given their own, unambiguous rank order based on the score, but the tied cases are in a random order. Third option is to include all the test-takers with the same score into the same bin (either L or U) and to change the cut-off dynamically. This requires, however, wide scale in the score to make it possible to keep the symmetricity in the number of test-takers in the cutoffs. Finally, one option is to develop a new coefficient based on non-symmetric cut-offs with polytomous items (cl. Brennan, 1972).

New operationalization of the concepts related to GDI
In order to generalize DI to the polytomous scales, new operationalizations of the concepts of R U , R L and T are needed. Some new symbols are also used. In what follows, the observed values (O) of the test-takers i in the item in the ordered data will be of interest.

 METSÄMUURONEN / Generalized Discrimination Index
The first main note to make is that, in the general notation for DI in Eq. (4), In the dichotomous cases, these sums equal the number of 1s in the upper and lower halves of the ordered and reduced data. For the general case, however, this re-operationalization is essential. Second, in the traditional formula for DI (see Eq. 3 and related discussion), T refers to the total number of test-takers in the reduced data in the specific cut-off a, and therefore ½Ta in Eq 4 refers to the number of test-takers in half of the cutoff a. However, in the general case, T does not refer to the number of cases but to the maximum possible sum minus the minimum possible sum of the observed values of test-takers in the specific cut-off a: where max [.] refers to the maximum possible value and min [.] refers to the minimum possible value. This definition is obvious when generalizing DI to the polytomous items where the minimum value of the item scale is something else than zero, such as in the Likert scale anchored to the values 1 to 5. In the general case, the maximum possible value is the same for all test-takers, and it is the maximum value in the item g: Parallel, the minimum possible value is the same for all individual test-takers, and it is the minimum value in the item: Because of (8) and (9) where a refers to the number of cases in half of the reduced dataset. Parallel, Thus, because of (7), (10), and (11) where a refers to the number of observations in the half of the specific cut-off a and     max g min g  is the range of the values in the scale of the item. Because of (4), (5), (6), and (12), GDI in a specific cut-off a is: where the constant C is the range of the values in the scale of the item.

Numerical example of calculating GDI with polytomous items
As a numerical example of calculating GDI, assume a polytomous dataset with N = 25 cases as in Table 2a. The dataset is from Cox (1974, p. 177) and Drasgow (1986, p. 70) without a connection with item analysis. However, let us assume that the dataset would relate with item g and the score X.  (Cox, 1974;Drasgow, 1986)  As benchmarks, some other estimators are referred here to in relation with Table 2a. The estimates of the observed association between the item and score, based on the mechanics of Pearson's product-moment correlation are by itemtotal correlation coefficient Rit = 0.185 and, after corrected for the inflation, by item-rest correlation coefficient Rir = 0.139. The estimate of the inferred association by polyserial correlation coefficient is  Table 2a, the estimates by GDI-both 0.167 and 0.143-are at the same range as those by Rit, Rir, and PC  . In any case, the discrimination power of the item is low: it cannot efficiently differentiate between the higher and lower scoring testtakers. Traditionally, this item would be considered as one of those to be omitted from the final compilation of the items.

An alternative way of computing GDI
It may be an obvious fact, though worth formalizing that, in each cut-off a+1 after the previous cut-off a, the next value of GDI is determined by the value of the next pair of individual test-takers in the upper half (  (5) and (6), Then, A new concept, characteristic vector of the item D is introduced. When the values of the i th test-takers are subtracted, the difference is symbolized by Di: The vector D consists of these differences in all cut-offs i. The total number of these cut-offs is ½N; if N is odd, the median case is omitted in the analysis. Then, obviously, the number of cut-offs is ½(N-1). For the simplicity reasons, let us assume that D has ½N elements   1 2 12 , ,..., ,..., The later computational form of GDI uses the sum of elements in D from the extreme cut-off (i = 1) to the particularly interesting cut-off i = a: Because of (17), (18), (19), and (20), Eq. (21) can be used in forming an alternative way of computing GDI at any cut-off a. Namely, according to (13), (21), and (14), the value of the GDI in the specific cut-off a is     where a is the number of test-takers in the half of the cut-off and C is a constant referring to the range in the scale of the item.
Obviously, the estimates are the same as in Eqs. (15) and (16). Again, we have two options for the estimate of item discrimination: 0.167 related to 24% cut-off or 0.143 related to 28% cut-off. Either way, the discrimination power of the item is low: it cannot differentiate between between the higher and lower scoring test-takers. The interpretation of the value of GDI is the same as that of the traditional DI; the same benchmarks for the low, mediocre, or high item discrimination can be used with both indices.

Further elaboration of GDI
Some further elaborations of GDI are discussed in what follows. This includes the procedure of exhaustive splitting (PES) and a new way of illustrating the behavior of the item called the cut-off curve (COC).

Exhaustive Splitting Procedure
A new tool for the further elaboration of the GDI is the procedure for exhaustive splitting already employed in Tables  2b and 3. PES is not a necessity in the actual calculation of GDI though it may offer a possibility to perform more effective computation and more refined items analysis. In the manual calculation, the formulae can be used without exhaustive splitting. In PES, instead of using only one fixed cut-off (25% or 27%), all possible cut-offs can be used in the item analysis. PES is as follows: 1. Take the ultimately highest and the ultimately lowest observation from the sorted data and calculate GDI. 2. Save the discrimination result from this calculation. 3. Take the two highest and the two lowest observations from the sorted data and calculate GDI (as in 1). Save the results. 4. Repeat steps 1 and 2, increasing the number of observations and gradually building up to ½N =50% of the observations at both extremes. When there are odd number of cases, the median case is left outside of the procedure. A table or graph of the results can be made, and this may be helpful in visualizing the characteristics of the items. In what follows, some relevant graphs are introduced as the discussion turns to the characteristics of the GDI. It is worth noting that PES is not restricted to DI or GDI; Metsämuuronen (2017), for example, used the same idea in illustrating the differences between underestimation in item-test correlation and item-rest correlation in comparison with DI.

Cut-off curve
Though visualization is not necessity in understanding the concept of GDI, the approach to GDI hereafter is easier to adopt with the assistance of graphical demonstrations. The concepts of the 'cut-off curve' (COC) are therefore introduced. Though the concept is not restricted to dichotomous datasets, for the sake of simplicity, dichotomous items are given as examples. Based on PES, we can form a graphical illustration of the values. This graph is called the cut-off curve (COC) though no actual "curve" exists because the distribution of the estimates is not continuous one. As a preliminary introduction to the graphical item analysis with COC, we assume a Guttman-patterned easy item of N = 18 test-takers ordered by the (unseen) score. From the lowest to the highest test-taker, the string is as follows: 000011111│111111111-the middle point is marked by a bar. Using the procedure of exhaustive splitting, there are ½N = 9 possible symmetric extreme cutoffs as shown in Table 4 and Figure 2. 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  9 0.44

Figure 2. Cut-point curve for a Guttman-patterned item with N = 18 and 4 zeros as a function of a
In COC, the threshold point of a Guttman-patterned item is seen as the point at where the ultimate discrimination (GDI = 1) changes dramatically and becomes smaller. In the Guttman-patterned case, this equals with the shorter of the ultimate strings of 0s and 1s; in Figure 2, a = 4 is the threshold for the curve. It may be worth noting again that even though it is possible to connect the discrete points together, no actual continuous curve exists in Figure 2. However, connecting the points visualizes the concept of the COC as a "curve". Another simple example of COC, with a non-Guttman type pattern, illustrates the pattern of COC that leads us to a more practical question of how the value of GDI is determined and dependent on the underlying Guttman pattern. Assume a dataset as in Table 5 with five Guttman patterned items with different difficulty level (items 1 to 5) and one non-Guttman-patterned item (item 6). The string of the non-Guttman-patterned item is 00100|01111 after ordered the testtakers by the (unseen) score. After the exhaustive splitting, we get 5 cut-offs for each item. In item 6, two extreme cutoffs show perfect discrimination after which the magnitude of GDI starts to get lower though not as logically as in Figure 2. The COCs are illustrated in Figure 3. The moves of the COC of the item 6 are elaborated in what follows.

Item 5 Guttman
Item 6 Non-Guttman  Figure 3 is that COC of a non-Guttman-patterned (real-world) item strictly follows the underlying grid formed by the (underlying) Guttman-patterned items. Another non-obvious note to make is that COC of a single item detects the deviations in the non-Guttman pattern in the dataset as a shift in COC of the underlying Guttman-patterned items in the consecutive cut-offs a and a + 1. In each point a, the COC have only limited options to go because, in each cut-off a + 1, the values of GDI are determined by the values in the previous cut-off a. This determination can be visualized by using COC. These matters are formalized in what follows.

Determination of the value of GDI and the moves in COC
From the practical viewpoint related to binary items, when Da+1 = 1, the path in COC moves forward to the next underlying curve of a Guttman-patterned item (Figure 4). If the result is Da+1 = 0, the next step will be on the same underlying curve as the previous point Da (i.e., no change in the underlying curve of a Guttman-patterned item). When the result is (theoretically pathological) Da+1 = -1, the path leads to the previous underlying curve (i.e. the path goes to the next cut-off but backwards to the curve of the previous Guttman-patterned item).
Of the three actual outcomes of the term 1 a D  in (25), the option -1 reflects an outcome of a pathological situation where the lower-scoring test-taker shows a higher response in the item than the higher-scoring counterpart. In the achievement testing this means that a lower-scoring test-taker gives, by a mistake or by a lucky guessing, a correct answer while a higher scoring test-taker gives, by carelessness or sleepiness, unexpectedly given the ability level, an incorrect answer (see the verbal descriptions in Linacre & Wright, 1994) . The option +1 refers to the expected output that the higher-scoring test-takers would give a correct answer while the lower-scoring test-takers would give an incorrect answer. The option 0 comes when the test-takers in both halves give the same value-either correct or incorrect one. All in all, in the dichotomous case, the value of GDIa+1 have one of the three fixed options: In the general case, the term The negative values represent options reflecting illogical and pathological behavior in the dataset that may lead to negative item discrimination if being many within one item. All in all, the negative values in the characteristic vector D in Eq. (19) are strictly indicative for the pathological cases. This matter is elaborated in the next section.

Pathological patterns in the visual item analysis with COC
The pathological cases characterized by negative elements (Di = -1) in the characteristic vector D can be detected easily by using PES, and those can be seen in COCs. The frequency of the elements Di = -1 in the characteristic vector D directly indicates the number of pathological pairs of test-takers in an item as discussed above. When the number of these negative pairs is higher than the positive pairs, the value of GDI turns to be (pathologically) negative: higherscoring test-takers appear to give the wrong answer in the items more likely in comparison with the lower-scoring testtakers. The examples given above were based on rather small and theoretical datasets; it is easy to illustrate the graphs when the number of cases is small. However, the exhaustive splitting procedure and cut-off curves are not restricted to small datasets. As an example of a larger dataset, a random sample of 200 real-world test-takers of a test of the national assessment of mathematics in Finland (FINEEC, 2018) is used as a basis for the illustration. Figure 5 illustrates this kind of COCs of two items with 100 cut-offs. The underlying Guttman-patterned items are shown as lighter lines. The pathological cases within the process are the rare cases where the COC moves to the previous Guttman-patterned latent curve (cf. Figure 4). Using PES and COC in the detection of plausible and stable values for GDI Figure 5 and the underlying exhaustive splitting procedure raise a natural question of how stable and plausible is our point-estimate of IDP if we used only a single cut-off? In practical terms, if the estimate of IDP at the point of a = 25% would be GDI25% = 0.167 and at the point of a = 27% GDI27% = 0.143, which of those would be the most credible estimate and why? Could we find a better or more credible estimate? Or an estimate of variance or standard deviation for the point estimate? Would a confidence interval of the estimate enrich our decision making in the analysis of the item behaviour by using GDI? Some initial ideas are discussed here though no final conclusion is reached. We remember that the idea in item discrimination is to answer how well the item can discriminate the higher scoring and lower scoring test-takers from each other. Traditional DI compares the item behaviour of the test-takers in the highest quartile to the lowest quartile and gives an estimate of the general behaviour of the item based on this estimate. On the other hand, by using PES, we would know all the possible estimates related to the same dataset. How could we utilize this kind of information in assessing the discrimination power of the item? Let us draw COC based on Tables 2b and 3 ( Figure 6). By using the PES and COC, we could assess how stable the estimate of 25% or 27% is-or whether there would be some other cut-off that could show evidence of more credible estimate of IDP than what is signalled in the cut-offs of 25% or 27%. Graphical diagnosis of Figure 5 tells that we would not find remarkably credible higher estimates for the item discrimination by GDI in any of the cut-off in comparison with the traditional cut-offs. The estimate is very stable between 12% cut-off and 48% cut-off, and the magnitude remains below 0.20 in all cut-offs. Figure 6. Stability of the estimates by GDI related to Table 2a Another sample of the use of graphical diagnostics with COC, based on larger sample (Figure 7), gives a clue that, when using the 27% cut-off with the difficult item (p = 0.225), the estimate for IDP is GDI27% = 0.35 and this seems to be fairly stable estimate. Just by using the graphical possibilities and intuitive heuristics, we may conclude that the value seems quite stable between the cut-offs 10% to 30% ranging from 0.32 to 0.40. The other item in Figure 7, the very easy one (p = 0.965), is less discriminative (GDI27% = 0.15) and, more crucially, the value ranges from 0.13 to 0.30 between the cut-offs 10% to 30% showing two times wider range in comparison with the difficult item. By using the values of estimates by PES, we could easily compute the average values of the estimates and, hence, the variance of the estimates and consequently, standard errors and the confidence intervals for the estimates. However, this matter is not formalized in this article, or any boundaries are suggested, though we note that PES gives possibilities to develop such tools for the IDP.

Results in a nutshell and related discussion
This article has discussed Kelley's DI as a simple nonparametric alternative short-cut method in estimating the item discrimination power in the practical testing settings in a rough manner. Unlike Pearson correlation, DI can reach the ultimate values 1  accurately when needed if the item difficulty is of medium level (roughly 0.25 < p < 0.75), and it is a more stable index in comparison with Rit. Although there would be better options from both underestimation and instability viewpoints-such as Somers' D that can reach the ultimate value also with the extremely easy or difficult items-the advantage of DI is its computational simplicity that makes is applicable in the practical testing settings in schools and other applied areas when the sophisticated tools for item analysis are not in use. This article generalizes DI to allow polytomous item scales in analysis. The generalized DI (GDI) can be applied in a wider sense also in analyzing, for example, the attitude scales or graded items in the achievement tests.
The generalization required new operationalizations of the traditional elements of DI, R U , R L and T. These allow us not only to use polytomous item scales but also varying cut-offs in the item analysis. Hence, the name "generalized DI", GDI, seems relevant. A new computational method based on the concept of characteristic vector of the item is initiated for the computer-based analysis; the classical way of calculating the value of GDI is still valid for the manual calculation. Additionally, a new method of visualizing the item analysis results, the cut-off curves (COC) with the related exhaustive splitting procedure (PES) are initiated in the article. The former can be used in a graphical analysis of the item, detecting the pathological cases in the dataset as well as in assessing the plausibility of the obtained point estimate of GDI. The latter application is not, however, elaborated or formalized in this article.
All in all, GDI has some positive characteristics: (1) it's easy to calculate even manually, (2) unlike the coefficients based on Pearson product moment correlation coefficient (Rit and Rir), it is robust against small changes in the dataset, and (3) it can detect the deterministic patterns and reaches the value GDI = 1  correctly. Hence, (4) GDI does not necessarily underestimate the association between the item and score in an obvious and mechanical manner as Rit and Rir do in the case when the scales of the item and the score are not equal. An underlying discussion relevant to Kelley's DI and the GDI is what should be the fixed cut-off for index-or should there be any? The standard way of using the fixed cut-off for DI (usually 25% or 27% of extreme test-takers) can, in most cases, be quite a good approximation for the real item discrimination even though, in practice, it appears to underestimate item discrimination in items with an extreme difficulty level. On the other hand, PES encourages us to consider whether some other than the traditional fixed cut-off should be chosen. In the theoretical Guttman-patterned case of a deterministically discriminating data structure, it would be an economical option to choose the cut-off that indicates the threshold point of the item, i.e., the cut-off indicated by the shorter of the extreme strings of 1s or 0s in the ordered data. This also leads a pathway toward the possibility of identifying a unique optimal cut-off in all real-world items as it is in theoretical Guttman-patterned items.

Limitations of the coefficients and the study and further suggestions
The main deficiency related to both DI and GDI is that they tend to underestimate the item discrimination power if the item difficulty is extreme and the cut-off is selected rigidly. The practical users of DI and GDI should be aware of this illbehavior with the items with extreme difficulty level. However, the coefficients may serve as useful short-cut methods in the practical testing settings to evaluate the overall discrimination power of the items or whether some item should be omitted in the compilation as non-discriminative one. Wider simulations of this character of GDI and DI would benefit us.
We may also note a practical challenge in both DI and GDI related to the tied cases which was not handled in the article: The exhaustive splitting procedure and the idea of vector D is based on the idea that the test-takers could be rank in a uniquely unambiguous manner. However, this is not possible when there are ties in the score because, within the tied cases, we do not know the actual order of the test-takers without some rationale. Some options were discussed within the text: (1) to double-order the test-takers, first, by the score and, second, by the items, (2) to give unambiguous and unique rank order by trusting the randomization in the ordering, (3) to include all the test-takers with the same score in a bin and to use dynamic cut-offs, and (4) to develop a new coefficient based on non-symmetric cut-offs with polytomous items (cl. Brennan, 1972).
The procedure and results presented in this article raises several questions and ideas for further studies. These include comparison of traditional classical indices for item discrimination with the exhaustive splitting procedure, the possibility of locating latent threshold points in real-world items by employing Guttman-patterned items within the classical test theory, possible models for continuous cut-off functions that would lead the GDI to allow continuous variables, and asymmetric versions of the cut-off approach for the polytomous variables (cl. Brennan, 1972). Further simulations concerning statistical properties of GDI may benefit us-some analyses of DI have been administered already (see Bazaldua, et al. 2017;Tristan, 1998;Kelley, et al. 2002). Specifically, the comparison with Somers' D or Goodman-Kruskal gamma may be of interest because these all are based on the order of the test-takers rather than the covariance between the item and score.