• Yoke Ong

Student thesis: Phd


When examinees with the same 'ability' take a test, they should have an equal chance of responding correctly to an item irrespective of group membership. This logic in assessment is known as measurement invariance. The lack of invariance of the item-, bundle-, and test-difficulty across different subgroups indicates differential functioning (DF). The aim of this study is to advance our understanding of DF by detecting, predicting and explaining the sources of DF by gender in a mathematics test. The presence of DF means that the test scores of these examinees may fail to provide a valid measure of their performance. A framework for investigating DF was proposed, moving from the item-level to a more complex random-item level, which provides a theme of critiques of limitations in DF methods and explorations of some advances. A dataset of 11-year-olds of a high-stakes National mathematics examination from England was used in this study. The results are reported in three journal publication format papers. The first paper addressed the issue of understanding nonuniform differential item functioning (DIF) at the item- level. The nonuniform DIF is investigated because it is a possible threat when common DIF statistics sensitive to uniform DIF may indicate no significant DIF. This study differentiates two different types of nonuniform DIF, namely crossing and noncrossing DIF. Two commonly used DIF detection methods, namely the Logistic Regression (LR) procedure and the Rasch measurement model were used to identify crossing and noncrossing DIF. This paper concludes that items with nonuniform DIF do exist in empirical data; hence there is a need to include statistics sensitive to crossing DIF in item analysis. The second paper investigated the sources of DF via differential bundle functioning (DBF) because this way we may get a substantive explanations of DF - without which we do not know if DF is 'valid' or 'biased'. Roussos and Stout's (1996a) multidimensionality-based DIF paradigm was used with an extension of the LR procedure to detect DBF. Three qualitatively different content areas: test modality, curriculum domains and problem presentation were studied. This paper concludes that DBF in curriculum domains may elicit construct-relevant variance, and so may indicate 'real' differences, whereas problem presentation and test modality arguably includes construct-irrelevant variance and so may indicate gender bias. Finally, the third paper considered item-person responses as hierarchically nested within items. Hence a two-level logistic model was used to model the random item effects, because otherwise it is argued that DF might be over-exaggerated and may lead to invalid inferences. This paper aimed to explain DF via DBF comparing single-level and two-level models. The DIF effects of the single-level model were found to be attenuated in the two-level model. A discussion of why the two different models produced different results was presented. Taken together, this thesis shows how validity arguments regarding bias should not be reduced to DF at item-level but can be analysed on three different levels.
Date of Award1 Aug 2011
Original languageEnglish
Awarding Institution
  • The University of Manchester
SupervisorJulian Williams (Supervisor)


  • Gender differences
  • Validity
  • Differential functioning

Cite this