28 Approach to Screening Model Building and Evaluation

28.1 Logistic Regression with LOGO Cross-validation

We opted for logistic regression models due to their interpretability, ease of implementation, and computational efficiency. We built a different prediction models for each grade in each language. All models have the following general form:

\[ Pr(Y_i = 1) = \sigma\big( \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + ... + \beta_n X_{in} \big) \] where \(P(Y_i = 1)\) denotes the probability of student \(i\) experiencing reading difficulty, the \(\beta\)s denote the coefficients (weights) of student \(i\)’s scores on the \(n\) different grade-specific screening tasks (i.e., \(X_{1i}\), \(X_{2i}\), …, \(X_{ni}\)), and \(\sigma()\) is the standard logistic sigmoid function \(f(x) = (1 + \exp(−x))^{−1})\). All models were implemented using the glm function from the stats package for R.

28.2 Selection of Predictor Tasks

We selected tasks for inclusion in the prediction model (?tbl-screener-tasks)based on the following theoretical, empirical, and pragmatic criteria:

We chose to include tasks from different domains to increase coverage of the broad range of skills that contribute to reading in English and Spanish (i.e., language, phonological awareness, processing speed.
We selected tasks that fit the model and were found to have strong correlations with the outcome measures for both English-only and multilingual students.
We included measures in English and Spanish that are known to predict future reading difficulty and dyslexia in each language.
We minimized the time of administration of the overall battery.

28.3 Evaluating Screener Performance

To evaluate each model’s performance and classification accuracy, we computed–among other metrics–sensitivity and specificity (Table 29.1 to Table 29.4). We targeted commonly used thresholds of > .80 and specificity > .70 (Johnson et al. 2009). We also provide receiver-operating characteristic curves and precision-recall curves for all models (Figure 29.1 and Figure 29.2). To assess the linguistic fairness of our English prediction model, we evaluated predictions not only for the entire sample, but also separately by English proficiency designation.

28.4 Other Notes

It is important to note that the English and Spanish universal screening batteries include the same measures across all grades. This decision, based on data and implementation considerations, was made to ensure accuracy, fairness, and efficiency. Different measures for the two languages would have required different testing times, additional teacher training, and made performance comparisons more difficult for children who need to be screened in both languages. The four assessments included for each grade and each language require less than 10 minutes of testing.