Categorical Data
Analysis |
The
following statistical indices are output when the categorical data analysis
of Exametrika is used. |
l Threshold l Entropy l Biserial correlation coefficient (polyserial correlation
coefficient) l Tetrachoric correlation coefficient (polychoric
correlation coefficient) |
Threshold |
l
Dichotomous (binary) data First, a standard
normal distribution is assumed underlying the data. In such a assumption,
when the correct response rate for an item is 0.85, the threshold would be
-1.04, as shown in the left-hand side figure. The threshold is obtained when
the area above the threshold is equal to the correct response rate. The item
on the left-hand side can be interpreted as an item with correct responses by
the examinees, whose capability is -1.04 or above within the standard normal
distribution. The figure on the right-hand side, on the other hand,
represents an item with a correct response rate of 10%. In this case, the
threshold becomes 1.28, which is interpreted as an item that cannot have
correct responses unless a person has a high ability level of 1.28 or higher
within the standard normal distribution. |
l
Polytomous (categorical ordered) data If there are K
categories, there are K-1 thresholds. Let us assume that there is an
item with four ordered categories, when selection rates are 0.2, 0.4, 0.3,
and 0.1. In such an instance, the three thresholds are derived by dividing
the area of the standard normal distribution to obtain the selection rates
described above. |
Entropy |
Entropy is an index
like the variance for categorical data (qualitative data). When every
examinee’s choice concentrates on a specific category, entropy
approaches 0. The value of entropy increases, however, as the selections of
examinees become more dispersed (not concentrated). In general, no
information exists for items for which there is a concentration of category
choices, as it would be the same whether or not a question was asked. Very
small items of entropy are often inappropriate as an item. |
Biserial
correlation (polyserial correlation) |
l
Biserial
correlation (dichotomous variable × continuous variable) A correlation
coefficient can be estimated by maximizing the occurrence probability
(likelihood) after assuming a 2-variate standard normal distribution behind
dichotomous categorically ordered data and continuous data. The correlation
coefficient obtained in this manner is called the biserial correlation
coefficient. The diagram above indicates a trend, wherein a slight tendency exists for variables of the X-axis to be higher when the dichotomous data of the Y-axis is 1. A 2-variate standard normal distribution with a moderate correlation fits in the background of the 2-variate data in such instances. Furthermore, the threshold (marked with a red line) comes slightly under the peak of the 2-variate normal distribution, since the number of observations is slightly larger when the dichotomous variable of the Y-axis is 1. The 2-variate normal distribution with a negative correlation fits behind the dichotomous variables since the continuous variables tend to be larger when the variables on the Y-axis are 0 in the diagram above. The biserial correlation coefficient is therefore obtained from the perspective of a 2-variate normal distribution with how size of the correlation fits best with the data. A correlation that is more reasonable than calculating a Pearson correlation by assuming dichotomous data to be continuous is obtained in this way. |
l
Polyserial correlation (polytomous variable × continuous variable) The biserial
correlation extended to polytomous data is referred to as the polyserial
correlation. The Likert scale data often used in psychological questionnaires
and social surveys is, in the strict sense, a categorically ordered data with
inconsistent intervals. Therefore, the Likert data with very few categories,
such as three-point ratings and four-point ratings, are not considered as
continuous data but should be treated as categorically ordered data. |
Tetrachoric
correlation (polychoric correlation) |
l
Tetracoric
correlation (dichotomous variable × dichotomous variable) The scatter
plots shown on the top-left and bottom-left panels are of two dichotomous
variables. The correlation coefficient that maximizes the occurrence
probability (likelihood) by assuming a 2-variate standard normal distribution
behind these two variables is called the tetrachoric correlation. There is more
data when the dichotomous variable of the X-axis is 0 in the diagram above,
thus the threshold (red vertical line) along the X-axis is positioned in the
positive direction from the peak of the 2-variate normal distribution.
Furthermore, the threshold of the Y-axis (red horizontal line) is slightly to
the negative side of the peak, as there is more data when the dichotomous
variable of the Y-axis is 1. Furthermore, of the data that was divided into
four, the number of data for (1, 1) and (0, 0) is slightly greater than that
for (0, 1) and (1, 0); thus, the tetrachoric correlation is obtained to be
positive. Inversely, this
data is more likely to occur when the 2-variate normal distribution with a negative
correlation is set behind this data, since in the diagram above the number of
data for (1, 0) and (0, 1) is greater than the number of data for (1, 1) and
(0, 0). Therefore, in such situations, the tetrachoric correlation is
obtained as a negative value. |
l
Polychoric
correlation (polytomous variable × polytomous variable) The polychoric
correlation is an extension of the tetrachoric correlation to a correlation
between the polytomously categorical variables. Since the Likert scale data,
often seen with psychological questionnaires and social surveys, do not have
exactly equal intervals. When the number of categories is small, such as with
three-point or four-point ratings, a more reasonable correlation is obtained
by treating it as categorically ordered data, rather than computing a Pearson
correlation coefficient by considering it to be continuous data. |