《An Introduction to Criterionreferenced Item Analysis and Item Response Theory英语论文.doc》由会员分享,可在线阅读,更多相关《An Introduction to Criterionreferenced Item Analysis and Item Response Theory英语论文.doc(13页珍藏版)》请在三一办公上搜索。
1、An Introduction to Criterion-referenced Item Analysis and Item Response Theory Abstract: Item Analysis (IA) is a term to be used more specifically in the context of classical measurement to refer to the application of statistical techniques to determine the properties of test items, principally item
2、 difficulty and item discrimination. The purpose for it is to select which items will remain on future revised and improved versions of the test. Sometimes, item analysis is also performed simply to investigate how well the items on a test are working with a particular group of students, or study wh
3、ich items match the language domain of interest. Item Response Theory (IRT) or latent trait theory, as it has been variously termed, is a general measurement theory developed independently by Birnbaum in the United States and by Rasch in Denmak. It refers to primarily, but not entirely, three famili
4、es of analytical procedures. These are identified as the one-parameter, the two-parameter and the three-parameter logistic models. What these models have in common is a systematic procedure for considering and quantifying the probability or improbability of individual item and person response patter
5、ns given the overall pattern of responses in a set of test data. They also offer new and improved ways for the estimation of item difficulty and person ability. The article is written to illustrate why item analysis and Item Response Theory (IRT) are useful for teachers when they construct tests and
6、 examining measurement equivalence. The introduction contains three parts: 1. Criterion or domain-referenced vs. Norm-reference or standardized tests. 2. Item analysis. 3. Item response theory. It will simply show the difference between Criterion-reference Tests (CRT) IA and Norm-referenced Tests IA
7、 because of the complication of IRT, it will only introduce the three-parameter model of IRT.Key words: criterion-reference test item analysis item response theory test1. IntroductionItem analysis (IA) is an aspect of test analysis which involves examination of the characteristics of test items (Ala
8、n Davies etc., 1999). It is used more specifically in the context of classical measurement to refer to the application of statistical techniques to determine the properties of test items, principally item difficulty and item discrimination. The purpose of it is to select which items will remain on f
9、uture revised and improved versions of the test. Sometimes, item analysis is also performed simply to investigate how well the items on a test are working with a particular group of students, or study which items match the language domain of interest. Item Response Theory (IRT) or latent trait theor
10、y, as it has been variously termed, is a general measurement theory developed independently by Birnbaum in the United States and by Rasch in Denmak (Grant Henning, 2001); refers to primarily, but not entirely, three families of analytical procedures. These are identified as the one-parameter, the tw
11、o-parameter, and the three-parameter logistic models. What these models have in common is a systematic procedure for considering and quantifying the probability or improbability of individual item and person response patterns given the overall pattern of responses in a set of test data. They also of
12、fer new and improved ways for the estimation of item difficulty and person ability. 2. Criterion or Domain-referenced vs. Norm-reference or Standardized Tests.From the definition of IA, we get to know IA, more specifically to belong to classical measurement, is related to item difficulty or Item Fac
13、ility (IF) and Item Discrimination (ID), which are used in traditional Norm-reference Tests (NRT) item analysis (James D. Brown and Thom Hudson, 2002). Before we study about item analysis, we should understand what are CRT and NRT first.2.1 Criterion-reference tests A general definition for a Criter
14、ion-referenced Test (CRT) was first provided by Glaser in 1963 (Rui Huang, a: 2004). He first defined criterion-referenced measures indicating the content of the behavioral repertory, and the correspondence between what an individual does and the underlying continuum of achievement. Measures, which
15、assess students achievement in terms of a certain criterion standard, thus provide information as to the degree of competence attained by a particular student who is independent of reference to the performance of others. (P.519). Later in 1971, he and Nitko made clear and simple definition for a CRT
16、: A criterion-referenced test is one that is deliberately constructed to yield measurements that are directly interpretable in terms of specified performance standards. Performance standards are generally specified by defining a class or domain of tasks that should be performed by the individual. (P
17、.653) In this case, criterion-referenced tests are useful for teachers both in clarifying teaching objectives and in determining the degree to which they have been met. CRTs are also often used for professional accreditation purposes, i.e. the test represents the types of behaviors considered critic
18、al for participation in the profession in question (Alan Davies: 38).Test sores of CRTs report a candidates ability in relation to the criterion, i.e. what the candidate can and cannot do, rather than comparing his/her performance with that of other candidates in the relevant population, such as hap
19、pens in norm-referenced tests. Test results are often reported using descriptive scales (i.e. percentage) rather than a numerical score. In contrast to norm-referenced tests the criterion, or cut-score, is set in advance (Rui Huang, b: 2004).2.2 Norm-referenced tests.A type of test whereby a candida
20、tes scores are interpreted with reference to the performance of the other candidates. Thus the quality of each performance is judged not in its own right, or with reference to some external criterion, but according to the standard of the group as a whole. In other words, norm-referenced tests are mo
21、re concerned with spreading individuals along with an ability continuum, the normal curve, than with the nature of the task to be attained, which is the focus of criterion-referenced tests (Alan Davies, 1999). Where an alternate version of a norm-referenced test is being developed, interpretation of
22、 raw scores on the new version of the test may be made in the light of normative performance (i.e. the mean and standard deviation) on the previous version, as is the case for widely administered tests such as TOEFL. For norm-referencing to be effective, it is important that there be a large number
23、of subjects, and a wide range of normally-distributed scores.The ranking capacity of norm-referenced tests is sometimes used to set cut-off scores, so that, for example, only got 60% of the test, those examinees are allowed to pass (Huizhong Yang, 2001). 2.3 Distinctions between criterion-referenced
24、 and norm-referenced testing.As many educators and members of the public fail to grasp the distinctions between CRTs and NRTs, we may draw a chart to compare these two types of tests from purpose, content and item characteristics. Much confusion can be eliminated if the basic differences are underst
25、ood. The following chart is adapted from: Popham, J.W (1975), which clearly distinguish CRTs from NRTs.DimensionCriterion-referenced testsNorm-referenced testsPurpose To determine whether each student has achieved specific skills or concepts. To find out how much students know before instruction beg
26、ins and after it has finished.To rank each student with respect to the achievement of others in broad areas of knowledge. To discriminate between high and low achievers.ContentMeasures specific skills which make up a designated curriculum. These skills are identified by teachers and curriculum exper
27、ts. Each skill is expressed as an instructional objective.Measures broad skill areas sampled from a variety of textbooks, syllabi, and the judgments of curriculum experts.ItemCharacteristicsEach skill is tested by at least four items in order to obtain an adequate sample of student performance and t
28、o minimize the effect of guessing. The items which test any given skill are parallel in difficulty.Each skill is usually tested by less than four items. Items vary in difficulty.Items are selected that discriminate between high and low achievers.ScoreInterpretationEach individual is compared with a
29、preset standard for acceptable achievement. The performance of other examinees is irrelevant. A students score is usually expressed as a percentage. (Rui Huang, b)Each individual is compared with other examinees and assigned a score-usually expressed as a percentile, a grade equivalent score, or a s
30、tanine. Students score is usually expressed as a percentile.3. Items Analysis In most language testing situations we are concerned with the writing, administration, and analysis of appropriate items. The test is considered to be no better than the items that go into its composition. Weak items shoul
31、d be identified and removed from the test. Thus, there are certain principles we can follow in writing items that may ensure greater success when the items undergo formal analysis. There are several ways to define items analysis (IA). Jack C. Richards etc (1992) defined IA is the analysis of the res
32、ponses to the items in a test, in order to find out how effective the test items are and to find out if they indicate differences between good and weak students. As Alan Davies mentioned in his Dictionary of Language Testing (1999:92) IA is an aspect of test analysis which involves examination of th
33、e characteristics of test items. The term is used more specifically in the context of classical measurement to refer to the application of statistical techniques to determine the properties of test items, principally item difficulty and item discrimination. As a whole, IA will be defined in this pap
34、er as systematic statistical evaluation of the effectiveness of individual test items. It is usually done for purposes of selecting which items will remain on future revised and improved versions of the test. Sometimes, however, item analysis is performed simply to investigate how well the items on
35、a test are working with a particular group of students, or to study which items match the language domain of interest. IA can take numerous forms, but when testing for norm-referenced purposes there are two traditional item statistics that are typically applied; item facility and item discrimination
36、. In developing CRTs, other difference index, B-index, agreement statistic, and item phi (). 3.1 Traditional item analysisTraditional NRT item analysis has been used for several years. It refers always to multiple-choice tests (Robert Wood, 1993). 3.1.1 Item facilityItem Facility goes by many other
37、names: item difficulty, item easiness, p-value, or abbreviated simply as IF (James. D. Brown: 114). Regardless of what is it called, it is a measure of the ease of a test item. It is the proportion of the students who answered the item correctly, and may be determined by the formula: Item Facility (
38、IF) = Where: R= number of correct answers; N= number of students taking the test The higher the ratio of R to N, the easier the item (Jack. C: 240). It is important to note that this formula assumes that items left blank by examinees are wrong (James, etc: 114).Calculating IF will result in values r
39、anging from 0 to 1.00 for each item. For instance, an IF index of 0.21(item 23 on Table 1) would indicate that 21% of the examinees answered the item correctly. This would seem to be a very difficult item because 79% are missing it. An IF of 0.94 (item 20 on Table 1) would indicate that 94% of the e
40、xaminees answered correctlya very easy item because almost all of the examinees got it right.The apparently simple information provided by the item facility statistic can prove very useful. Consider the pattern of right and wrong answers shown in Table 2 (taken from Rui Huang, b: 2004). The examinee
41、s responses are recorded as 1s for correct answers and 0s for incorrect answers. Notice that item 4 was answered correctly by every examinee (as indicated by the 1s straight down that column). It is equally easy to identify that item 9 is the most difficult because no examinee answered it correctly.
42、 According to the IF formula we can calculate all items IF which are shown on Table 3. Therefore item 9 and item 4 may be revised or rejected. Table2: Item analysis data (first 10 items only)Students No. ITEMSTOTAL1 2 3 4 5 6 7 8 9 10 etc%36373839401 1 1 1 1 0 1 1 0 1 . 960 1 1 1 1 0 1 0 0 1 . 950 0
43、 1 1 1 0 1 0 0 1 . 921 0 1 1 1 0 0 0 0 1 . 911 1 1 1 0 0 1 0 0 1 . 9041424344451 0 1 1 0 0 1 1 0 1 . 900 1 1 1 1 0 1 0 0 1 . 881 0 1 1 0 0 1 0 0 1 . 800 1 1 1 0 1 1 0 0 1 . 790 1 1 1 0 1 0 0 0 1 . 7246474849501 0 0 1 1 1 1 1 0 1 . 671 0 0 1 0 1 0 1 0 1 . 661 1 0 1 0 1 0 0 0 1 . 640 0 0 1 0 1 0 1 0 1
44、 . 640 0 0 1 0 1 1 0 0 0 . 613.1.2 Item discrimination Another important characteristic of a test item is how well it discriminates between weak and strong examinees in the ability being tested. Difficulty alone is not sufficient information upon which to base the decision ultimately to accept or re
45、ject a given item. Consider, for example, an item 15 (IF=0.5) in Table 1, which half of the examinees pass and half fail. Using difficulty as the sole criterion, we would adopt this item as an ideal item. But what if we discovered that the persons who passed the item were the weaker half of the exam
46、inees, and the persons who failed the item were the stronger examinees in the ability being measured. This would certainly cause us to have different thoughts about the suitability of such an item. If our test were comprised entirely of such items, a high score would be an indication of inability an
47、d a low score would be an indication of comparative ability. What we need at this point is a method of computing item discrimination. Item discrimination (ID) is an entirely different statistic, which shows the degree to which an item separates the “upper” examinees from the “lower” ones. These groups may also be called the “high ”and “low” scores or the “upper” and “lower” proficiency groups. Usually the upper and lower groups are defined as the upper and lowe