3 Fairness in Testing

The United States has a diverse population of children in grades K-2, and assessment developers should attend to and measure how equitably and accurately children’s skills are estimated across populations. The Multitudes suite of measures was designed with fairness as a core design and development principle. In 2014, the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education 2014) added a chapter dedicated to fairness in testing. The chapter presented four views of fairness: (a) lack of bias, (b) equitable treatment in the testing process, (c) equality in outcomes of testing, and (d) opportunity to learn. The Multitudes development team was committed to integrating these concepts of fairness into all aspects of the assessment development process. The following section describes how we attended to each.  

3.1 Lack of Bias

The large corpora of words and images selected for each measure were reviewed by an internal diversity and equity committee as well as teachers and administrators from districts across the state. We removed or revised items that lacked broad cultural resonance or that were found to be linguistically inappropriate. We intentionally included broad racial and ethnic representation in the images selected for the measures. Importantly, we also analyzed classification accuracy across populations to check for bias in identification of risk (see the section titled “Universal Screening”).

3.2 Equitable Treatment in the Testing Process

The Multitudes team developed an extensive administration manual that outlines administration and scoring directions for each measure to support standardization. Student-facing directions for each measure are recorded and embedded in the assessment platform to support standardized administration of the tests. Feedback is provided via the platform to encourage engagement. Receptive measures are scored automatically through the platform, thus reducing testing bias and administrator interpretation (e.g., Elision Receptive, Listening Comprehension, Spelling, Semantic Mapping).

We provide significant guidance for how to score verbal responses provided in African American/Black English (AAE) to reduce the likelihood that Black children will be unfairly evaluated. AAE follows regular phonological, semantic, and morpho-syntactical patterns specified and included in our scoring schemes (Washington and Seidenberg 2021). For more, please see the UCSF Multitudes Administrator Guides for Language Variation.

The administration manual also provides comprehensive guidance about how to score multilingual children’s expressive responses and, if administering measures in English, describes how English language production might be influenced by other language(s) to which a child is exposed. Any child with emerging English proficiency may speak English in ways influenced by their heritage language; some sounds, words, and grammatical elements might transfer, and some might result in cross-linguistic interference. Specific recommendations are provided for how to score Arabic, Mandarin, Tagalog, and Vietnamese-influenced English. Likewise, a child’s Spanish language production will be reflective of regional dialects. Regional variation is natural and expected, and responses should be scored accordingly. For example, in the Spanish expressive vocabulary measure many items have multiple correct responses as different words may be used by individuals from various countries or even areas of the same country who speak Spanish. For more, please see the UCSF Multitudes Administrator Guides for Language Variation.

All children deserve to be comfortable and supported during the assessment process. There is considerable attention given in the administration manual to creating a testing environment and administrator-child interactions that respect the child’s developmental level and their need for security. The manual provides specific recommendations for warm-up activities that can be incorporated into the testing process to build rapport, verbal prompts that can promote a child’s engagement, and guidance about what to do if a child expresses distress or disinterest.

3.3 Equality in Outcomes of Testing

The testing of historically marginalized children and emergent bilinguals with assessments that were not designed to accurately reflect their abilities is a concern (Randall et al. 2023); (Solano-Flores 2023). The accuracy of universal screening identification was compared across key groups to ensure that these populations would not be over- or under-identified (see the Section titled “Universal Screening”). UCSF Multitudes was designed to elevate the skills and abilities that these children bring to the classroom through an asset-based lens and a development process that centered on the lived experiences of these communities and the linguistic assets they bring in their native language(s).

3.4 Opportunity to Learn

Although we cannot control for the varied educational experiences of all children who take the Multitudes screener, we can provide evidence-based resources to promote the delivery of effective instruction. There is a critical need for professional development, coaching, and ongoing technical support to ensure that the performance of each child is accurately and fairly interpreted and that the next steps taken are appropriate and effective. As such, the Multitudes platform includes instructional recommendations and professional development opportunities related to administration and instruction. Instructional recommendations come with a menu of student learning activities in English and Spanish aligned with reading development goals identified through screening.

In summary, fairness in assessment is now considered to be integral to arguments about validity (Sireci and Randall 2021); (Solano-Flores 2023). The rich diversity in the United States requires a measurement approach that is culturally and linguistically responsive and aims to reduce bias when estimating children’s abilities. Importantly, absolute fairness in assessment is impossible to achieve, because no measurement instrument demonstrates perfect reliability and any validity judgment is a matter of context and degree. Given this reality, Multitudes adopts a growth mindset, with plans for ongoing research to continually improve fairness and accuracy in reading assessment, ultimately with the mission and vision of contributing to enhanced equity in educational outcomes across all populations.