7  Psychometric Approach

The Multitudes measures can be grouped into two categories, based on the way in which they are administered, analyzed, and scored. The measures include computerized adaptive tests (CAT) and fixed-form measures. Below, we explain the process of developing and gathering reliability and validity evidence for each type of measure.

7.1 Computerized Adaptive Testing Measures

Ten measures were developed for computerized adaptive testing (CAT; Gershon 2005; Wainer et al. 2000; Weiss 1985) using Rasch modeling to calibrate item parameters: Elision Expressive, Elision Receptive, Expressive Vocabulary, Listening Comprehension, Nonword Reading, Nonword Repetition, Semantic Mapping, Sentence Repetition, Spelling, and Word Reading. These parameters were then used to create algorithms that select the items the child encounters on the test. CAT measures increase efficiency and precision by presenting children with items at their ability level and engaging them in testing only until their performance is established, computed as a theta value. The applicable analyses for each measure are reported in the section titled “Measure Development.”

7.1.1 Rasch Scaling

Rasch scaling is a widely used mathematical method to model the measurement of an unobserved (latent) trait or ability (Andrich 1988). The Rasch model for dichotomous data is often regarded as an item response theory (IRT) model with one item parameter. However, rather than being a particular IRT model, it possesses a property that distinguishes it from other IRT models. Specifically, the defining property of Rasch scaling is the principle of invariant measurement. With the invariant measurement property, different subsets of items can be administered to different children, and comparable objective scale scores can be obtained. When the outcome of Rasch scaling is combined with computerized adaptive testing, items are most informative of the child’s proficiency level, and scale scores are reported in real time.

Rasch scaling has four basic assumptions that must be met (Andrich 1988):

  1. Unidimensionality. The assumption that only one latent dimension determines how a child responds to a given item, namely the child’s ability or skill level.
  2. Monotonicity. Higher proficiency levels translate to a higher probability of responding correctly to an item.
  3. Local Independence. Children’s response to one item is conditionally independent of their responses to all other items after accounting for the latent dimension that these items measure.
  4. Invariance. Item parameters can be obtained from any group of children, regardless of where they fall on the ability scale.

Rasch scaling allows constructing unitary scales across grades, allowing for more accurate and consistent comparisons across children and grades on the same construct. In choosing Rasch scaling as our statistical model, we focused on developing items that are evidence-based representations of each unitary construct (Wilson 2023).

7.1.2 Data Collection Design

Different grade-level forms were constructed after creating the initial pool for each task. For each grade level, we constructed six forms. Each form included a set of unique items and anchor items. The ratio of anchor items to the total number of items in any form ranged from 20% to 30%. Anchor items were purposefully spread across different forms to create enough overlap across forms within a grade as well as across grade levels. Each form was administered to at least 100 children. The form construction and data collection design allowed us to calibrate item parameters for a large number of items across grades.

7.1.3 Classical Item Analysis and Item Parameter Calibration

A classical item analysis was conducted before Rasch scaling to identify problematic items. Basic item statistics were computed for each item, such as the proportion of correct responses and point-biserial correlations. Items with point-biserial correlations lower than 0.20 were flagged and removed from the data before Rasch scaling. The Rasch model was fitted using the TAM package for R, using marginal maximum likelihood estimation (MLE) to estimate item parameters. After estimating item parameters, we estimated person parameters using the maximum likelihood estimate. In addition, we generated a Wright Map to examine the alignment between the difficulty of items and the distribution of children’s proficiency. We also calculated the residuals for each person on each item based on the estimated model parameters.

7.1.4 Assessing Unidimensionality

One way to evaluate the assumption of unidimensionality after fitting a Rasch model is to calculate the proportion of variance in responses attributed to the primary latent variable (G. Jr. Engelhard and Wang 2020; Linacre 2006). In the IRT literature, it is typical to use the variance accounted for by the latent factor as evidence of unidimensionality. For instance, Reckase (1979) argued that the first factor should account for at least 20% of the total variance to produce reasonable person parameter estimates. We adopted the approach described by Linacre (2006). We first found the variance associated with the observed responses (VO) and the variance associated with residuals (VR) after fitting the model. Then, the percent of variance explained by the Rasch model can be found by (VO – VR)/ VO.

7.1.5 Item Fit Statistics

Item fit can be categorized according to the framework suggested by G. Engelhard, Wang, and Wind (2018). According to this framework, items can be put into four different categories based on their Infit and Outfit Mean-Square values, as shown in Table 7.1.

Infit means inlier-sensitive or information-weighted fit. This is more sensitive to the pattern of responses to items targeted at the child’s ability level, and vice versa. Outfit, or outlier-sensitive fit, is more sensitive to responses to items with difficulty far from a child’s ability, and vice versa. Mean-square (MSE) fit statistics show the size of the randomness, i.e., the amount of distortion of the measurement system, with an expected value of 1. Values less than 1.0 indicate observations are too predictable (redundancy, data overfit the model). Values greater than 1.0 indicate unpredictability (unmodeled noise, data underfit the model). In general, values near 1.0 indicate little distortion of the measurement system, regardless of the standardized value. We evaluated mean-squares high above 1.0 before mean-squares much below 1.0, because the average mean-square is usually forced to be near 1.0. Outfit problems are less of a threat to measurement than Infit issues and are easier to manage. To evaluate the fit statistics of each item, we used the thresholds outlined in Table 7.1. Items that fell into category D were removed from the final item pool.

Table 7.1: Infit-Outfit Mean-Square Values
Infit.Outfit.MSE.Value Interpretation Fit.Category
0.5 < MSE < 1.5 Productive for measurement A
MSE < 0.5 Less productive, but not distorting of measures B
1.5 < MSE < 2 Unproductive, but not distorting of measures C
MSE > 2 Unproductive and distorting of measures D

7.1.6 Ongoing Qualitative Item Review

In addition to evaluating item-level statistics, our team conducted a thorough review of items for cultural and linguistic relevance, representation, and appropriateness. This review was carried out by a multidisciplinary group of project members with varied cultural, linguistic, disciplinary, and lived experiences, who examined test items for potential concerns that could arise in the field or for specific student populations. We also held several focus groups with teachers and administrators to gather feedback on the test items. In addition, selected experts in the field reviewed images, wording, administration procedures, and scoring. During calibration studies, proctors recorded feedback in a shared spreadsheet while administering items, noting any concerns raised by children across the state. Proctors were also invited to internal executive meetings to share their experiences.

Our team carefully considered data from all of these sources. Items were removed not only based on poor statistical performance but also when they were found to be potentially offensive in certain communities, unfamiliar to some groups in ways that could introduce bias, ambiguous in meaning, or reinforcing of harmful stereotypes. These extensive procedures gave the team greater confidence that the final pool of items was both psychometrically robust and culturally and linguistically appropriate for use throughout California.

7.1.7 Computerized Adaptive Testing (CAT)

Computerized Adaptive Testing (CAT) aims to construct an optimal test for each child by estimating ability after each item administration and selecting subsequent items from an item pool based on the child’s estimated proficiency. (Meijer and Nering 1999; Wainer et al. 2000). CAT has several advantages over paper-and-pencil tests, including greater efficiency and precision, with scores immediately available.

After calibrating item parameters for each task, we designed a CAT algorithm to administer these tasks. There are five important decisions to make during a CAT algorithm:

How to:

  • start a test session?
  • estimate proficiency during the test after every item administration?
  • select the next item?
  • stop the test?
  • estimate the proficiency at the end of the session once the test is terminated?

The number of options for each decision poses a challenge for an optimal design. Practical considerations (e.g., the number of items a kindergartener responds to without frustration) guided certain decisions. We also ran computer simulations to inform our approach. Below, we present a summary of parameters currently used in the CAT algorithm in the Multitudes platform. We continuously consider these design choices to improve our measurement.

  • Starting a CAT session. The algorithm randomly selects one of the 30 items that give the most information about average proficiency, meaning items whose difficulty levels are relatively close to the average ability level. The session starts by administering this randomly selected item.
  • Proficiency estimation during the test. The algorithm uses the maximum a posteriori (MAP) estimation after the first item and uses MAP estimates after every item during the test.
  • Next item selection. The algorithm uses Maximum Fisher Information (MFI) to select the next item based on the MAP proficiency estimate from the previous items.
  • Final proficiency estimate. The algorithm uses Maximum Likelihood Estimation (MLE) to obtain the final proficiency estimate after the test is terminated.
  • Terminating the test. The algorithm uses a fixed test length rule to stop the test. We ran extensive computer simulations to find an optimal test length for each test. The simulations were carried out using the catR (Magis and Raîche 2012) package, which was revised to fix a seed usage issue (Cui 2020). The person parameters were generated for 1,000 simulees using the empirical PDF estimated from the distribution observed for each task. We simulated two independent CAT sessions for each simulee by manipulating the test length from six to fifteen items, with one based on the true person parameter and item parameters. From each simulated CAT session, we saved the final person parameter estimate. Then, we calculated three outcomes:
  1. The correlation between the final person parameter estimates from the first CAT session and the true person parameter estimates.

  2. The correlation between the final person parameter estimates from the second CAT session and the true person parameter estimates.

  3. The correlation between the final person parameter estimates from the first CAT session and the final person parameter estimates from the second CAT session

The correlation in (a) and (b) can be considered as an estimate of the validity coefficient, and the correlation in (c) can be considered as an estimate of the test-retest reliability coefficient. For each task, we find the optimal test length such that the test-retest reliability coefficient reaches 0.8 and the validity coefficient reaches 0.9 from these simulations.

7.2 Fixed Form Measures

Seven measures are administered in standard forms that do not vary and were developed using classical test theory: Digit Span, Letter Naming Fluency, Letter Sound Fluency, Narrative Story Production, Oral Reading Fluency, Rapid Automatized Naming Letters, and Rapid Automatized Naming Objects. Some were analyzed and scored using the raw number of correct responses (e.g., Digit Span). Some were timed, with the final score being the number of correct items over the total time taken to complete the task (e.g., RANL, RANO). Others followed a different scoring scheme. Basic descriptive statistics are provided for these measures, including the distribution of scores, the mean level of performance, and the standard deviation.