##### Document Text Contents

Page 1

University of Nebraska - Lincoln

[email protected] of Nebraska - Lincoln

Public Access Theses and Dissertations from the

College of Education and Human Sciences

Education and Human Sciences, College of (CEHS)

8-2016

THE EFFECTS OF MISSING DATA

TREATMENT ON PERSON ABILITY

ESTIMATES USING IRT MODELS

Sonia Mariel Suarez Enciso

University of Nebraska-Lincoln, [email protected]

Follow this and additional works at: http://digitalcommons.unl.edu/cehsdiss

Part of the Educational Psychology Commons

This Article is brought to you for free and open access by the Education and Human Sciences, College of (CEHS) at [email protected] of

Nebraska - Lincoln. It has been accepted for inclusion in Public Access Theses and Dissertations from the College of Education and Human Sciences by

an authorized administrator of [email protected] of Nebraska - Lincoln.

Suarez Enciso, Sonia Mariel, "THE EFFECTS OF MISSING DATA TREATMENT ON PERSON ABILITY ESTIMATES USING

IRT MODELS" (2016). Public Access Theses and Dissertations from the College of Education and Human Sciences. 274.

http://digitalcommons.unl.edu/cehsdiss/274

http://digitalcommons.unl.edu?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.unl.edu/cehsdiss?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.unl.edu/cehsdiss?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.unl.edu/college_educhumsci?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.unl.edu/cehsdiss?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/798?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.unl.edu/cehsdiss/274?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

Page 80

79

the data are missing according to the person’s response and it said to MAR. The third

category refers to the missingness caused by the “don’t know” or “not applicable.” This

missingness is not related to the ability being measured and thus is another type of MAR.

Finally, the last category also has an unknown cause, but in this case the nonresponse is

dependent on the person ability (MNAR).

According to these categories, Holman and Glas presented four different IRT-

based approaches that model the missingness mechanism, three of them being different

cases of MNAR and one of MAR. Basically, Holman and Glas (2005) assume that there

are two latent traits that drive the person’s decision about whether to answer an item. The

first one is the response propensity which represents individual characteristics (e.g.,

personality trait, omission propensity, etc.) that affect the person’s propensity to answer

an item. This latent trait is not measured by the instrument. The second latent trait is the

person ability that is measured with the instrument. The four approaches are based on the

dependence of both observed response and nonresponse on these two latent traits, and the

relationship between them, ( ).

In model 1 (G1), the probability of a particular observation for the person i on the

item j, xij, depends on ; the probability of a particular nonresponse (dij) depends on ;

and there is no relationship between and , ( ) = 0.0. In this case, the data are

ignorable (missing at random). The nonignorable missingness is modeled in three

different situations. In all the situations it is assumed that and have a common

distribution. In model 2 (G2), the probability of xij depends on ; the probability of dij

depends on ; and there is a relationship between and , ( ) ≠ 0.0. In model 3 (G3),

Page 81

80

the probability of xij depends on , but the probability of dij depends on both and , and

( ) ≠ 0.0. Finally, in model 4 (G4), the probability of xij and dij depends on both and

, and ( ) ≠ 0.0.

Although these “missingness models” can be combined with different data

analysis models, the authors only worked with them in conjunction with IRT analysis

models (GPCM, PCM, 2PL and Rasch) using MML. Factors for the study with simulated

dichotomous data and 50% of the data missing were sample sizes (N=500, 1000, and

2000), test length (J=10, 20, and 30), and levels of ( ) (0.0, 0.1, 0.2… 0.9). The closer

( ) is to 1 (normally positive) the more likely the missingness can be considered

nonignorable. Holman and Glas showed that the higher the ( ) the greater the bias of

item parameter estimates if the missingness is treated as ignorable when analyzing the

data. That is, ignoring nonignorable missing data yielded biased estimators. The bias in

the item parameter estimates recovery was higher for shorter tests and small sample.

However, these biases can be reduced by incorporating or modeling the missingness

through the above described models.

Additionally, Holman and Glas (2005) tested the proposed models with empirical

data (32 five-point rating scale items) in two ways. First, they modeled the missingness

with the Rasch model and the observed data with the PCM. Second, the missingness was

modeled with the 2PL and the observed with GPCM. In both ways, they used the four

missingness models and an additional model in which there was only one latent trait (i.e.,

( ) = 1) that determined the probabilities of both xij and dij (named G0). They found

that (a) the GPCM fitted the observed data better than the PCM and that G3 (with GPCM)

Page 160

159

Endnotes

1 R software has a module called multivariate imputation by chained equations (MICE) that

implements this method. Given its popularity, sometimes FCS is called MICE.

2 The National Assessment of Educational Progress (NAEP) defines a not-reached item as the

one “to which the student did not respond because the time limit was up for the section of the assessment

on which s/he was working. After the first "not reached" item, the student will have no responses to any

further questions on that section of the assessment" (NAEP Glossary, n.d.). Therefore, the first item with

missing response is treated as [intentionally] omitted and the following non-responses are treated as not

administered (Mislevy and Wu, 1988). The Australian Council for Educational Research (ACER) defines

not-reached items when there are more than two blank answers.

3 Argentina, Brazil, Chile, Colombia, Costa Rica, Cuba, Ecuador, El Salvador, Guatemala,

Honduras, Mexico, Nicaragua, Panama, Paraguay, Dominican Republic, Uruguay, and the Mexican State

of Nuevo Leon.

4 Website: http://www.unesco.org/new/en/santiago/education/education-assessment/, and for the

SERCE data: http://www.unesco.org/new/fileadmin/MULTIMEDIA/FIELD/Santiago/zip/bcf362e6.zip

5 In TIMSS, 2011, 3.2% and 4.5% of the students have omitted and not-reached responses,

respectively. In PIRLS (2011) 8.9% of students omitted responses(Foy et al., 2011; Organisation for

Economic Co-operation and Development, 2012).

6 It is not possible to talk about not-reached and omitted responses in rating scale data, therefore

non-answered items are referred as missing responses.

7 RDS replaces a missing value with a random draw from the permitted response options. IAS

imputes (a) the incorrect answer, when item is scored as right or wrong, or (b) the answer that is socially

most undesirable (i.e., worst case scenario) for attitude items. IMS imputes the missing values with the

mean of observed cases in the item. PMS replaces missing values with the average of the observed

responses for each case. CIM adjusts the item mean by taking into account the respondent’s ability. ICS

imputes the missing value with the observed responses on the item with which the item with missing values

has the highest correlation. HNC uses as the donor the first complete case after the incomplete case. HDD

uses the complete case for which the distance from the incomplete case is minimized. HDR first selects

several donors with small distance from the incomplete case. Then, one of them is randomly selected

(Huisman, 2000).

8 “Even though Schafer (1997) provided a way to combine likelihood ratio test statistics in MI,

no empirical studies have evaluated the performance of this pooled likelihood ratio test under various data

condition. Also, this test has not been incorporated into popular statistical packages” (Dong & Peng, 2013,

p. 15)

9 There is a website that more formally tracks the work done with MI,

http://www.stefvanbuuren.nl/mi/index.html. However, this statement is done basically comparing the

number of papers that either have the methods as part their title or they are mentioned in the document.

10 Mean conditional on the covariates (CM): “imputes the mean based on the available scores

across all items of all persons within the same covariate class, and imputes this mean for each missing in

this covariate class”. Overall mean (OM): imputation based on the data matrix mean. Two-way imputation

(TW): the imputation for the missing observation (i, j) = IMS + PMS – OM (Bernaards & Sijtsma, 2000).

The two-way imputation with normally distributed error (TW-E) is an imputation method that corrects both

for person effect and item effect, and adds a random error drawn from a normal distribution (µ=0, σɛ

2) to

the imputation process. The corrected item-mean with normally distributed error (CIM-E) implies that “the

item mean is corrected for person i’s score level relative to the mean of the items to which he/she

responded. Normally distributed errors are added to CIMij using a procedure similar to the one used for

adding normally distributed errors in method TW-E” (van Ginkel et al., 2007, pp. 391-393).

11 The factor loading recovery was measured with the Tucker’s ϕ (Burt, 1948; Tucker, 1951)

and the in Bernaards and Sitjsma (1999). In Bernaards and Sijtsma’s (2000) study and Πγ (i.e., the

product of estimated eigenvalues) were the indicators. The Tucker’s ϕ is a coefficient of congruence that

http://www.unesco.org/new/en/santiago/education/education-assessment/

http://www.unesco.org/new/fileadmin/MULTIMEDIA/FIELD/Santiago/zip/bcf362e6.zip

http://www.stefvanbuuren.nl/mi/index.html

Page 161

160

measures the similarities between the factors derived from factor analysis. It is basically a correlation

coefficient. The index is the average of the D2 across all the sample replications within each condition.

D2 is the sum of squared differences, divided by the number of extracted factors based on the complete data

and the corresponding factor loadings based on the imputed datasets using the methods aforesaid

(Bernaards & Sijtsma, 1999, 2000).

12 BIC: Bayesian Information Criterion, AIC: Akaike Information Criterion, and AIC3 is a

modified index of AIC Vermunt, van Ginkel, van der Ark, and Sijtsma (2008).

13 Sijtsma and van der Ark (2003) study is based on two main parts. Only one part is presented

in this document. The second part of the study refers to two methods to determine the missingness

mechanism, originally proposed by Huisman (1999). One of them is done at the data matrix level (the

Huisman’s (1999) asymptotic test), while the second method does it at the item level. For details, see

Sijtsma and van der Ark’s (2003) publication.

14 “R1c tests whether the response functions of the J items are logistic with the same slope

against the alternative that they deviate from these conditions, and statistic Q2 tests whether the test is

unidimensionality against the alternative of multidimensionality” (Sijtsma & van der Ark, 2003, p. 520).

15 SERCE missing data were recoded following the procedure described by other large-scale

assessments (e.g., PISA, TIMSS, and PIRLS). That is, the first missing response in the blank-response

string was considered omitted and the rest are coded as not-reached. For example, a student’s pattern

response such as 43231Z1Z43442Z3ZZZZZZZZZ (where “Z” is SERCE’s code for missing responses) was

recoded as 43231Z1Z43442Z3ZRRRRRRRR, where “R” are not-reached responses. Notice that the firs “Z”

was kept, given that this is normally taken as reached, thus intentionally omitted (Mislevy & Wu, 1988).

16 Thanks to Yem Ahiatsi for writing the algorithm.

17 Thanks to Dr. Rafael De Ayala for writing the algorithm.

18 In the imputation with regression model, variables with non-missing values are considered

covariate in the imputation process.

University of Nebraska - Lincoln

[email protected] of Nebraska - Lincoln

Public Access Theses and Dissertations from the

College of Education and Human Sciences

Education and Human Sciences, College of (CEHS)

8-2016

THE EFFECTS OF MISSING DATA

TREATMENT ON PERSON ABILITY

ESTIMATES USING IRT MODELS

Sonia Mariel Suarez Enciso

University of Nebraska-Lincoln, [email protected]

Follow this and additional works at: http://digitalcommons.unl.edu/cehsdiss

Part of the Educational Psychology Commons

This Article is brought to you for free and open access by the Education and Human Sciences, College of (CEHS) at [email protected] of

Nebraska - Lincoln. It has been accepted for inclusion in Public Access Theses and Dissertations from the College of Education and Human Sciences by

an authorized administrator of [email protected] of Nebraska - Lincoln.

Suarez Enciso, Sonia Mariel, "THE EFFECTS OF MISSING DATA TREATMENT ON PERSON ABILITY ESTIMATES USING

IRT MODELS" (2016). Public Access Theses and Dissertations from the College of Education and Human Sciences. 274.

http://digitalcommons.unl.edu/cehsdiss/274

http://digitalcommons.unl.edu?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.unl.edu/cehsdiss?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.unl.edu/cehsdiss?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.unl.edu/college_educhumsci?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.unl.edu/cehsdiss?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/798?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://digitalcommons.unl.edu/cehsdiss/274?utm_source=digitalcommons.unl.edu%2Fcehsdiss%2F274&utm_medium=PDF&utm_campaign=PDFCoverPages

Page 80

79

the data are missing according to the person’s response and it said to MAR. The third

category refers to the missingness caused by the “don’t know” or “not applicable.” This

missingness is not related to the ability being measured and thus is another type of MAR.

Finally, the last category also has an unknown cause, but in this case the nonresponse is

dependent on the person ability (MNAR).

According to these categories, Holman and Glas presented four different IRT-

based approaches that model the missingness mechanism, three of them being different

cases of MNAR and one of MAR. Basically, Holman and Glas (2005) assume that there

are two latent traits that drive the person’s decision about whether to answer an item. The

first one is the response propensity which represents individual characteristics (e.g.,

personality trait, omission propensity, etc.) that affect the person’s propensity to answer

an item. This latent trait is not measured by the instrument. The second latent trait is the

person ability that is measured with the instrument. The four approaches are based on the

dependence of both observed response and nonresponse on these two latent traits, and the

relationship between them, ( ).

In model 1 (G1), the probability of a particular observation for the person i on the

item j, xij, depends on ; the probability of a particular nonresponse (dij) depends on ;

and there is no relationship between and , ( ) = 0.0. In this case, the data are

ignorable (missing at random). The nonignorable missingness is modeled in three

different situations. In all the situations it is assumed that and have a common

distribution. In model 2 (G2), the probability of xij depends on ; the probability of dij

depends on ; and there is a relationship between and , ( ) ≠ 0.0. In model 3 (G3),

Page 81

80

the probability of xij depends on , but the probability of dij depends on both and , and

( ) ≠ 0.0. Finally, in model 4 (G4), the probability of xij and dij depends on both and

, and ( ) ≠ 0.0.

Although these “missingness models” can be combined with different data

analysis models, the authors only worked with them in conjunction with IRT analysis

models (GPCM, PCM, 2PL and Rasch) using MML. Factors for the study with simulated

dichotomous data and 50% of the data missing were sample sizes (N=500, 1000, and

2000), test length (J=10, 20, and 30), and levels of ( ) (0.0, 0.1, 0.2… 0.9). The closer

( ) is to 1 (normally positive) the more likely the missingness can be considered

nonignorable. Holman and Glas showed that the higher the ( ) the greater the bias of

item parameter estimates if the missingness is treated as ignorable when analyzing the

data. That is, ignoring nonignorable missing data yielded biased estimators. The bias in

the item parameter estimates recovery was higher for shorter tests and small sample.

However, these biases can be reduced by incorporating or modeling the missingness

through the above described models.

Additionally, Holman and Glas (2005) tested the proposed models with empirical

data (32 five-point rating scale items) in two ways. First, they modeled the missingness

with the Rasch model and the observed data with the PCM. Second, the missingness was

modeled with the 2PL and the observed with GPCM. In both ways, they used the four

missingness models and an additional model in which there was only one latent trait (i.e.,

( ) = 1) that determined the probabilities of both xij and dij (named G0). They found

that (a) the GPCM fitted the observed data better than the PCM and that G3 (with GPCM)

Page 160

159

Endnotes

1 R software has a module called multivariate imputation by chained equations (MICE) that

implements this method. Given its popularity, sometimes FCS is called MICE.

2 The National Assessment of Educational Progress (NAEP) defines a not-reached item as the

one “to which the student did not respond because the time limit was up for the section of the assessment

on which s/he was working. After the first "not reached" item, the student will have no responses to any

further questions on that section of the assessment" (NAEP Glossary, n.d.). Therefore, the first item with

missing response is treated as [intentionally] omitted and the following non-responses are treated as not

administered (Mislevy and Wu, 1988). The Australian Council for Educational Research (ACER) defines

not-reached items when there are more than two blank answers.

3 Argentina, Brazil, Chile, Colombia, Costa Rica, Cuba, Ecuador, El Salvador, Guatemala,

Honduras, Mexico, Nicaragua, Panama, Paraguay, Dominican Republic, Uruguay, and the Mexican State

of Nuevo Leon.

4 Website: http://www.unesco.org/new/en/santiago/education/education-assessment/, and for the

SERCE data: http://www.unesco.org/new/fileadmin/MULTIMEDIA/FIELD/Santiago/zip/bcf362e6.zip

5 In TIMSS, 2011, 3.2% and 4.5% of the students have omitted and not-reached responses,

respectively. In PIRLS (2011) 8.9% of students omitted responses(Foy et al., 2011; Organisation for

Economic Co-operation and Development, 2012).

6 It is not possible to talk about not-reached and omitted responses in rating scale data, therefore

non-answered items are referred as missing responses.

7 RDS replaces a missing value with a random draw from the permitted response options. IAS

imputes (a) the incorrect answer, when item is scored as right or wrong, or (b) the answer that is socially

most undesirable (i.e., worst case scenario) for attitude items. IMS imputes the missing values with the

mean of observed cases in the item. PMS replaces missing values with the average of the observed

responses for each case. CIM adjusts the item mean by taking into account the respondent’s ability. ICS

imputes the missing value with the observed responses on the item with which the item with missing values

has the highest correlation. HNC uses as the donor the first complete case after the incomplete case. HDD

uses the complete case for which the distance from the incomplete case is minimized. HDR first selects

several donors with small distance from the incomplete case. Then, one of them is randomly selected

(Huisman, 2000).

8 “Even though Schafer (1997) provided a way to combine likelihood ratio test statistics in MI,

no empirical studies have evaluated the performance of this pooled likelihood ratio test under various data

condition. Also, this test has not been incorporated into popular statistical packages” (Dong & Peng, 2013,

p. 15)

9 There is a website that more formally tracks the work done with MI,

http://www.stefvanbuuren.nl/mi/index.html. However, this statement is done basically comparing the

number of papers that either have the methods as part their title or they are mentioned in the document.

10 Mean conditional on the covariates (CM): “imputes the mean based on the available scores

across all items of all persons within the same covariate class, and imputes this mean for each missing in

this covariate class”. Overall mean (OM): imputation based on the data matrix mean. Two-way imputation

(TW): the imputation for the missing observation (i, j) = IMS + PMS – OM (Bernaards & Sijtsma, 2000).

The two-way imputation with normally distributed error (TW-E) is an imputation method that corrects both

for person effect and item effect, and adds a random error drawn from a normal distribution (µ=0, σɛ

2) to

the imputation process. The corrected item-mean with normally distributed error (CIM-E) implies that “the

item mean is corrected for person i’s score level relative to the mean of the items to which he/she

responded. Normally distributed errors are added to CIMij using a procedure similar to the one used for

adding normally distributed errors in method TW-E” (van Ginkel et al., 2007, pp. 391-393).

11 The factor loading recovery was measured with the Tucker’s ϕ (Burt, 1948; Tucker, 1951)

and the in Bernaards and Sitjsma (1999). In Bernaards and Sijtsma’s (2000) study and Πγ (i.e., the

product of estimated eigenvalues) were the indicators. The Tucker’s ϕ is a coefficient of congruence that

http://www.unesco.org/new/en/santiago/education/education-assessment/

http://www.unesco.org/new/fileadmin/MULTIMEDIA/FIELD/Santiago/zip/bcf362e6.zip

http://www.stefvanbuuren.nl/mi/index.html

Page 161

160

measures the similarities between the factors derived from factor analysis. It is basically a correlation

coefficient. The index is the average of the D2 across all the sample replications within each condition.

D2 is the sum of squared differences, divided by the number of extracted factors based on the complete data

and the corresponding factor loadings based on the imputed datasets using the methods aforesaid

(Bernaards & Sijtsma, 1999, 2000).

12 BIC: Bayesian Information Criterion, AIC: Akaike Information Criterion, and AIC3 is a

modified index of AIC Vermunt, van Ginkel, van der Ark, and Sijtsma (2008).

13 Sijtsma and van der Ark (2003) study is based on two main parts. Only one part is presented

in this document. The second part of the study refers to two methods to determine the missingness

mechanism, originally proposed by Huisman (1999). One of them is done at the data matrix level (the

Huisman’s (1999) asymptotic test), while the second method does it at the item level. For details, see

Sijtsma and van der Ark’s (2003) publication.

14 “R1c tests whether the response functions of the J items are logistic with the same slope

against the alternative that they deviate from these conditions, and statistic Q2 tests whether the test is

unidimensionality against the alternative of multidimensionality” (Sijtsma & van der Ark, 2003, p. 520).

15 SERCE missing data were recoded following the procedure described by other large-scale

assessments (e.g., PISA, TIMSS, and PIRLS). That is, the first missing response in the blank-response

string was considered omitted and the rest are coded as not-reached. For example, a student’s pattern

response such as 43231Z1Z43442Z3ZZZZZZZZZ (where “Z” is SERCE’s code for missing responses) was

recoded as 43231Z1Z43442Z3ZRRRRRRRR, where “R” are not-reached responses. Notice that the firs “Z”

was kept, given that this is normally taken as reached, thus intentionally omitted (Mislevy & Wu, 1988).

16 Thanks to Yem Ahiatsi for writing the algorithm.

17 Thanks to Dr. Rafael De Ayala for writing the algorithm.

18 In the imputation with regression model, variables with non-missing values are considered

covariate in the imputation process.