References

1. Arshad Abd. Samad, 2010, Essentials Language Testing for Malaysian Teachers, Universiti Putra Malaysia Press
2. http://www.slideshare.net/anasomoray/chapter-6-validity-reliability?from_search=2
3. Berk, R., 1979.  Generalizability of Behavioral Observations:  A Clarification of Interobserver Agreement and Interobserver Reliability.  American Journal of Mental Deficiency, Vol. 83, No. 5, p. 460-472.
4. Cronbach, L., 1990.  Essentials of psychological testing.  Harper & Row, New York.

5. Carmines, E., and Zeller, R., 1979.  Reliability and Validity Assessment. Sage Publications, Beverly Hills, California.

Lectures

Videos and Explanation

Reliability & Validity by zapohd1965


Part 2


Reliability & Validity by miamipsych293

Part 1

Part 2

Part 3


Improving Reliablity

Ø     Clear conceptualization
Ø     Standardization
Ø     Increase the number of items
Ø     Use more precise measurement
Ø     Use multiple indicators
Ø     Pilot testing and replication

Factors That Can Affect The Inter-rater Reliability

There are several factors that can affect inter-rater reliability

  • Knowledge background.


v  Markers’ knowledge background is somehow a factor that can alter the reliability for marking process. It is because their views on certain issues may vary, thus affecting their agreement on the issues. This may lead to differences in evaluation during the marking.

  • Markers’ experience in marking.


v  A group of markers’ with distinct differences in experiences may lead to disagreement among them. Experience may alter their perceptions towards candidates’ performances, thus becoming a major factor that can affect the inter-rater reliability.

Intra-rater Reliability

          Intra-rater reliability refers to the degree of agreement among multiple repetitions of a diagnostic test performed by a single rater. In all cases of test marking, a marker should be consistent; meaning that he or she must be confident in marking the same performance under different circumstances. It means that their judgments are not altered by any external factors. Thus, it is recommended for a double routine marking as a way to come up with average score, leading to the final result of the performance.

          However, the only way to come up with inter-rater reliability is by allowing the examiner to re-mark the performance (scripts) that has been marked. The fact that the examiner is remarking a script should be left out of the process, so that it will not become an affecting factor during the process. After getting the score, the Team Leader for the examiners can determined the correlative means and standard deviations. Thus, the reliability of the examiner can be analyzed for further actions.  


Importance Of Intra-rater Reliability

  • Improve single marker consistency in evaluating candidates’ performance.
  • Eliminating external factors that can compromise marker’s evaluation.
  • Increase the confidence in marker’s ability.
  • Increase the marker’s competence in repeated occurrence.

Factors That Can Affect The Intra-rater Reliability

There are several factors that might affect the intra-rater reliability;
  • Time
v  Time allocated for marking is a huge factor in intra-rater reliability. As example, if a longer time is allocated for the marker, it can help the marker to focus more and avoiding only certain papers at the early part of the marking session to be marked with care.
  • Marker’s condition.
v  Fatigue and unable to focus are habitual factor involving markers. Marker’s condition can play a vital role in intra-rater reliability, as it can compromise marker’s judgment during the marking.
  • Experience.
v  Less experience is another common factor in language marking. A good examiner must have in-depth understanding on the subject, and able to give unyielding evaluation under several circumstances. Thus, marker’s experience in marking is another significant factor in intra-rater reliability.

How (Method) To Calculate The Rank Order Correlation (inter-rater and intra-rater)

i.             Carry out a rank order correlation of the first and second sets of marks. 
       This is to check ‘intra-rater reliability’.
ii.            Correlate the first set of marks with another person’s first set of marks. 
       This is to check ‘inter-rater reliability’.

The formula


N: number of student
d: the difference of the rank

Sample of data tabulation
  1. i.          Intra-rater reliability


1st time
2nd time
d = R1-R2
d2
Mark
Rank (R1)
Mark
Rank (R2)
Student 1






Student 2






Student 3






Student 4






Student 5







  1. ii.        Inter-rater reliability


Own 1st time
Other 1st time
d = R1-R0
d2
Mark
Rank (R1)
Mark
Rank (R0)
Student 1






Student 2






Student 3






Student 4






Student 5







Check out these links for more detailed explanation:

Inter-rater Reliability

Inter-rater reliability is the degree to which an assessment tool produces stable and consistent results; the extent to which 2 or more raters agree. The reliability depends upon the raters to be consistent in their evaluation of behaviors or skills. The raters must have unbiased measurements of student’s competency and address the consistency of the implementation of evaluation systems being utilized. 

Importances of Inter-rater Reliability

•        Imposes level of objectivity.
•        Provides validation of evaluation results.
•        Certify skills competency
•        Assign the appropriate “grade”
•        Increases “confidence” that people are following the guidelines in            like manner.

Reliability and Validity

These two terms, reliability and validity, are often used interchangeably when they are not related to statistics. When critical readers of statistics use these terms, however, they refer to different properties of the statistical or experimental method.

Reliability is another term for consistency. If one person takes the same personality test several times and always receives the same results, the test is reliable.

A test is valid if it measures what it is supposed to measure. If the results of the personality test claimed that a very shy person was in fact outgoing, the test would be invalid.

Reliability and validity are independent of each other. A measurement maybe valid but not reliable, or reliable but not valid.

Analogy
 Suppose your bathroom scale was reset to read 10 pound lighter. The weight it reads will be reliable (the same every time you step on it) but will not be valid, since it is not reading your actual weight.

Reliability is necessary, but not sufficient for validity.

A test can be reliable without being valid. As example, a test is reliable to be used in testing, but it is not valid because it can only be used to a certain group of students (test about Malaysia’s history is not valid to foreign students). In this case, the test is reliable to evaluate students understanding about the topic. However, it is not valid to foreign students as the items are not generalized to them. This is what it means as a test can be reliable without being valid.


In the other hand, a test cannot be valid without being reliable. An unreliable test cannot be considered valid, because a valid test must be able evaluate what it intended to test. In order to do that, the test must be reliable in the first place. Thus, a valid test must be reliable, but a reliable test does not necessarily a valid test.

Factors That Can Affect The Reliability

Several factors can affect the reliability of a test (Mehrens & Lehmann, 1991; Sattler, 2001):

  • Test length.

v  Generally, the longer a test is, the more reliable it is.
  • Speed.

v  When a test is a speed test, reliability can be problematic. It is inappropriate to estimate reliability using internal consistency, test-retest, or alternate form methods. This is because not every student is able to complete all of the items in a speed test. In contrast, a power test is a test in which every student is able to complete all the items.
  • Group homogeneity.

v  In general, the more heterogeneous the group of students who take the test, the more reliable the measure will be.
  • Item difficulty.

v  When there is little variability among test scores, the reliability will be low. Thus, reliability will be low if a test is so easy that every student gets most or all of the items correct or so difficult that every student gets most or all of the items wrong.
  • Objectivity.

v  Objectively scored tests, rather than subjectively scored tests, show a higher reliability.
  • Test-retest interval.

v  The shorter the time interval between two administrations of a test, the less likely that changes will occur and the higher the reliability will be.
  • Variation with the testing situation.

v  Errors in the testing situation (e.g., students misunderstanding or misreading test directions, noise level, distractions, and sickness) can cause test scores to vary.

Types of Reliability


There are four general classes of reliability estimates, each of which estimates reliability in a different way. There are:

I.            Inter-Rater or Inter-Observer Reliability

v  Used to assess the degree to which different raters/observers give consistent estimates of the same phenomenon.
v  Inter rater reliability involves two or more judges or raters, marking the same paper. Scores on a test are independent estimates of these judges or raters. A score is more reliable and accurate measure if two or more raters agree on it. The extent to which the raters agree will determine the level of reliability of the score. In inter-rater reliability, the correlation between the scores of the two judges or raters is calculated.

Intra-rater Reliability

v  Used to assess the consistency of results across items within a test.

v  While inter-rater reliability involves two or more raters, intra-rater reliability is the consistency of grading by a single rater. Scores on a test are rated by a single rater/judge at different times. When we grade test at different times, we may become inconsistent in our grading for various reasons. Some papers that are graded during the day may get our full and careful attention, while others that are graded towards the end of the day may be very quickly glossed over or marked with paying little attention to it. As such, intra rater reliability determines the consistency of a single teacher’s or rater’s grading of the same papers, at different times.

II.            Stability (Test-Retest) Reliability

v  Used to assess the consistency of a measure from one time to another.
v  In test-retest reliability, the same test is re-administered or re-evaluate to the same people. The scores obtained on the first administration of the test are correlated ot the scores obtained on the second administration of the test. It is expected that the correlation between two scores would be high. However, test-retest would be somehow difficult to be conducted, as it is unlikely for the students to take the same test twice. Memorization and effect of practice will temper with the correlation value.

III.            Parallel-Forms Reliability

v  Used to assess the consistency of the results of two tests constructed in the same way from the same content domain.
v  In this type of reliability, two similar test are administered to the same sample of person or people with the same level of proficiency. Therefore, as in test-retest reliability, two scores are obtained and correlated. However, unlike test-retest, the parallel or equivalent forms reliability measure is protected from the influence of memory/memorizing, as the same questions are not asked in the second of the two tests.

IV.            Split Half Reliability

v  In this reliability, a test is administered once to a group, is then divided into two equal halves after the students have returned the test, and the halves are then correlated. As the means for determining reliability is internal, within one administration of the test, this method computing reliability is considered as an internal consistency measure. Halves are often determined based on the number assigned to each item with one half consisting of odd numbered items, and the other half even numbered items.