Measuring Reliability
Measuring Reliability of Instrumentation
For evaluation instruments, reliability is a measure of the consistency and stability of the participants’ results. It answers the question “Does this instrument yield repeatable results?” If you are a teacher, you know that creating a test for assessing students’ knowledge of while loops or conditional statements can be time-consuming. You want to ensure that the test or quiz measures the same knowledge for every student who takes it.
We can measure instruments for reliability in several ways. The results of one or more measures provide evidence of reliability for the instrument.
Internal consistency reliability (Reliability across items)
If you have a set of 6 Likert-scale items that are designed to measure self-efficacy of students who are studying programming for the very first time and all items were stated in the positive (“I am confident I can learn programming” instead of “I am not confident I can learn programming”), then students wouldn’t answer “Strongly Agree” on three items and “Strongly Disagree” on the other three. A test for internal consistency, such as Cronbach’s α (the Greek letter alpha), would indicate that the three “Strongly Disagree” questions should be removed from the instrument.
There are several ways to test for internal consistency. Cronbach’s α is the most popular. A value of 0.80 or higher is generally considered to indicate that the construct being measured has good internal reliability, though it should be noted that, as the number of items in a scale increases, it may be possible to obtain a relatively high Cronbach’s α even with relatively low levels of correlation between items.
Inter-rater reliability (Reliability across researchers)
Evaluation instruments are no different. Whenever there are two or more researchers scoring the participants’ answers to open-ended questions, for example, whether it be a qualitative study or open-ended questions as part of a quantitative study, the researchers need to be trained in how to interpret the results (or code them). This can be done in various ways, including keeping track of the rater of each item, having two or more researchers always rate the same items and then averaging the score, calibrating the rating methods by having all researchers score the same five participant results and then discussing how or why their scores were different, and several more.
While this form of reliability will frequently show up in research studies, it is less likely to show up in the description of an instrument, as high inter-rater reliability is more a measure of how well a study has been carried out than how well a rubric or set of open-ended questions has been designed (though there is some relationship).
Test-retest reliability (Reliability across time)
Parallel forms reliability and Split-half reliability
Split-half reliability is very similar. In this case, a test, perhaps larger in its set of questions, is divided into two. The two tests are given to two different sets of participants. The scores for each half are then compared with the each other. The set of questions that give the most consistent results is then used. This is measured through a correlation (Pearson’s r or Spearman’s rho) between the two different halves of the instrument. The resulting coefficients are analyzed using the Spearman-Brown formula to determine the split-half reliability coefficient, which produces the aggregate measure of reliability.
Select here to go to next page to learn about Validity.
Cite this page
To cite this page, please use:
McGill, Monica M. and Xavier, Jeffrey. 2019. Measuring Reliability and Validity. Retrieved from https://csedresearch.org
Refrences
Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents.
Creswell, J. (2008). Educational Research: Planing, Conducting, and Evaluating Quantitative and Qualitative Research. Upper Saddle River, New Jersey, USA: Pearson Education, Inc.
Cronbach, L. J.; Meehl, P.E. (1955). “Construct Validity in Psychological Tests”. Psychological Bulletin. 52 (4): 281–302. doi:10.1037/h0040957. PMID 13245896.
Lee, A.S., Hubona, G.S. (2009). A scientific basis for rigor in Information Systems research. MIS Quarterly, 33(2), 237-262.
Polit, D.F., Beck, C.T. (2012). Nursing Research: Generating and Assessing Evidence for Nursing Practice, 9th ed. Philadelphia, USA: Wolters Klower Health, Lippincott Williams and Wilkins.
Trochim, W. (2006). Web Center for Social Research Methods. Available online at https://www.socialresearchmethods.net
