Units & Measures
Cohen’s Kappa (κ): Measuring Interrater Agreement
Cohen’s kappa quantifies how much two raters agree on categorical assignments after accounting for agreement expected by chance. κ values range from –1 (complete disagreement) through 0 (chance-level agreement) to 1 (perfect agreement). This article defines the statistic, recounts its development, clarifies conceptual assumptions, and highlights applications in medicine, manufacturing, and machine learning.
Key facts
- Quantity. κ = (Po − Pe) ÷ (1 − Pe); dimensionless.
- Interpretation. κ = 1 indicates perfect agreement; κ = 0 indicates chance-level agreement.
- Variants. Weighted κ handles ordinal categories using linear or quadratic weights.
Related articles
- Jaccard Index: Similarity of Sets
Contrast overlap-based similarity with chance-corrected agreement metrics.
- Bit Error Rate: Communication Reliability Metric
See how error rates and agreement statistics both track classification performance.
- Coefficient of Variation: Relative Dispersion Statistic
Combine variability measures with agreement coefficients to evaluate measurement systems.
Calculators
- Confidence Interval Calculator
Compute approximate intervals for κ using standard error formulas and normal approximations.
- Bayesian A/B Test Sample Size
Plan labeling or audit studies with sufficient pairs of ratings to achieve target κ precision.
- Mann–Whitney Sample Size
Estimate sample needs when κ comparisons involve ordinal or ranked categories.
Definition and calculation
Po represents observed agreement (the proportion of items rated identically), while Pe represents expected agreement based on marginal totals. Subtracting chance agreement and scaling by the maximum possible agreement yields κ. Weighted κ assigns partial credit to near agreement on ordered categories, increasing sensitivity for Likert or severity scales. Always disclose weighting schemes and category prevalence to support reproducibility.
Historical background
Jacob Cohen introduced κ in 1960 to improve upon simple percent agreement, which ignores chance concordance. His work at the Educational Testing Service influenced psychometrics, psychology, and survey research. Subsequent extensions, such as Fleiss’ kappa for multiple raters and quadratic weighting for ordinal data, broadened applicability across disciplines including radiology, pathology, and content moderation.
Concepts and interpretation
κ depends on category prevalence and marginal distributions: imbalanced categories can depress κ even with high percent agreement. Prevalence- and bias-adjusted κ variants address extreme skews, but analysts should also report confusion matrices and class-specific performance. Confidence intervals, obtainable via bootstrap or asymptotic methods, convey precision. Avoid rigid qualitative cutoffs; instead, interpret κ relative to the stakes of disagreement and measurement context.
Applications
- Clinical diagnostics. Radiologists and pathologists compare κ when validating imaging criteria or biopsy grading systems.
- Manufacturing quality. Inspectors’ agreement on defect categories informs training and gauge repeatability studies.
- Machine learning labeling. Annotation teams monitor κ to track labeling consistency before model training.
- Safety audits. Environmental and occupational safety checklists rely on κ to validate auditor consistency across sites.
Importance and best practices
Clear documentation keeps κ meaningful. Report the contingency table, weighting scheme, prevalence of each category, and confidence intervals. Calibrate raters with training sets, periodic retraining, and drift monitoring to maintain κ over time. Combining κ with percent agreement, confusion matrices, and per-class precision–recall provides a holistic view of measurement reliability.
Because κ is dimensionless and sensitive to prevalence, it should be compared only across studies with similar category distributions. When κ declines, root-cause analysis—ambiguous labels, inadequate guidance, or instrument bias—helps teams restore agreement and data quality.
Calculators to support κ analysis
Use these tools to derive confidence bounds and plan sample sizes that keep agreement estimates defensible.
-
Confidence Interval Calculator
Compute approximate intervals for κ using standard error formulas and normal approximations.
Launch -
Bayesian A/B Test Sample Size
Plan labeling or audit studies with sufficient pairs of ratings to achieve target κ precision.
Launch -
Mann–Whitney Sample Size
Estimate sample needs when κ comparisons involve ordinal or ranked categories.
Launch