Validity of two subjective skin tone scales and its implications on healthcare model fairness
Xu, J. et al. Algorithmic fairness in computational medicine. eBioMedicine 84, 104250 (2022).
Google Scholar
Barrett, T., Chen, Q. & Zhang, A. Skin deep: investigating subjectivity in skin tone annotations for computer vision benchmark datasets. In FAccT 23 Proc 2023 ACM Conference on Fairness, Accountability and Transparency. 2023;1757–1771. https://doi.org/10.1145/3593013.3594114.
National Institute of Standards and Technology (NIST). Face Recognition Vendor Test (FRVT): Part 3 – Demographic Effects. National Institute of Standards and Technology (2019).
Ibrahim, S. A. & Pronovost, P. J. Diagnostic errors, health disparities, and artificial intelligence: a combination for health or harm? JAMA Health Forum 2, e212430 (2021).
Google Scholar
Sjoding, M. W., Dickson, R. P., Iwashyna, T. J., Gay, S. E. & Valley, T. S. Racial bias in pulse oximetry measurement. N. Engl. J. Med. 383, 2477–2478 (2020).
Google Scholar
Adamson, A. S. & Smith, A. Machine learning and health care disparities in dermatology. JAMA Dermatol. 154, 1247 (2018).
Google Scholar
Daneshjou, R., et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci. Adv. 8, eabq6147 (2022).
Google Scholar
Weir, V. R., Dempsey, K., Gichoya, J. W., Rotemberg, V. & Wong, A. K. I. A survey of skin tone assessment in prospective research. NPJ Digit Med. 7, 191 (2024).
Google Scholar
Wilkes, M., Wright, C. Y., du Plessis, J. L. & Reeder, A. Fitzpatrick skin type, individual typology angle, and melanin index in an african population: steps toward universally applicable skin photosensitivity assessments. JAMA Dermatol. 151, 902–903 (2015).
Google Scholar
Kinyanjui, N. M. et al. Estimating skin tone and effects on classification performance in dermatology datasets. (2019).
Fitzpatrick, T. B. The validity and practicality of sun-reactive skin types I through VI. Arch. Dermatol. 124, 869 (1988).
Google Scholar
D’Orazio, J., Jarrett, S., Amaro-Ortiz, A. & Scott, T. UV radiation and the skin. Int J. Mol. Sci. 14, 12222–12248 (2013).
Google Scholar
Department of Surgical Oncology, Fox Chase Cancer Center, Philadelphia, PA, USA, Ward, W. H. Lambreton, F. et al. Clinical Presentation and Staging of Melanoma. Department of Surgical Oncology, Fox Chase Cancer Center, Philadelphia,PA, USA, Ward WH, Farma JM, Department of Surgical Oncology, Fox Chase Cancer Center, Philadelphia,PA, USA, eds. Cutaneous Melanoma: Etiology and Therapy. Codon Publications; 79–89. (2017).
Subedi, S. K. & Ganor, O. Considerations for the use of fitzpatrick skin type in plastic surgery research. Plast. Reconstr. Surg. Glob. Open. 12, e5866 (2024).
Google Scholar
Heldreth, C. M., et al. Which skin tone measures are the most inclusive? An investigation of skin tone measures for artificial intelligence. ACM J. Respons. Comput. 1, 1–21 (2024).
Google Scholar
Monk, E. The monk skin tone scale. 2023. https://doi.org/10.31235/osf.io/pdf4c.
Doshi, T. Improving skin tone representation across Google. Google. May 2022. https://blog.google/products/search/monk-skin-tone-scale/.
Centers for Medicare & Medicaid Services. CMS Cell Suppression Policy. U.S. Department of Health and Human Services; (2017).
Schumann, C. et al. Consensus and subjectivity of skin tone annotation for ML fairness. (2023).
Cobb, R. J., Thomas, C. S., Laster Pirtle, W. N. & Darity, W. A. Self-identified race, socially assigned skin tone, and adult physiological dysregulation: Assessing multiple dimensions of “race” in health disparities research. SSM – Popul. Health 2, 595–602 (2016).
Google Scholar
Lu Y. et al. Skin coloration is a culturally-specific cue for attractiveness, healthiness, and youthfulness in observers of Chinese and western European descent. Jones, A., ed. PLoS ONE. 16, e0259276 (2021).
Monk, E. P., Kaufman, J. & Montoya, Y. Skin tone and perceived discrimination: health and aging beyond the binary in NSHAP 2015. Wallace,R., ed. J Gerontol Ser B. 76 S313–S321 (2021).
Campbell, M. E., Keith, V. M., Gonlin, V. & Carter-Sowell, A. R. Is a picture worth a thousand words? An experiment comparing observer-based skin tone measures. Race Soc. Probl. 12, 266–278 (2020).
Google Scholar
Kiritchenko, S. & Mohammad, S. M. Best-worst scaling more reliable than rating scales: a case study on sentiment intensity annotation. (2017).
Fasugba, O., Gardner, A. & Smyth, W. The Fitzpatrick Skin Type Scale: A reliability and validity study in women undergoing radiation therapy for breast cancer. J. Wound Care 23, 358–368 (2014).
Google Scholar
Krishnapriya, K. S., King, M. C. & Bowyer, K. W. Analysis of manual and automated skin tone assignments for face recognition applications. (2021).
Jubran, A. & Tobin, M. J. Reliability of pulse oximetry in titrating supplemental oxygen therapy in ventilator-dependent patients. Chest 97, 1420–1425 (1990).
Google Scholar
Fawzy, A., et al. Skin pigmentation and pulse oximeter accuracy in the intensive care unit: a pilot prospective study. Am. J. Respir. Crit. Care Med. 210, 355–358 (2024).
Google Scholar
Foglia, E. E. et al. The effect of skin pigmentation on the accuracy of pulse oximetry in infants with hypoxemia. J. Pediatr. 182, 375–377.e2 (2017).
Google Scholar
Wong, A. K. I., et al. Analysis of discrepancies between pulse oximetry and arterial oxygen saturation measurements by race and ethnicity and association with organ dysfunction and mortality. JAMA Netw. Open 4, e2131674 (2021).
Google Scholar
Fawzy, A., et al. Racial and ethnic discrepancy in pulse oximetry and delayed identification of treatment eligibility among patients with COVID-19. JAMA Intern. Med. 182, 730–738 (2022).
Google Scholar
Fawzy, A., et al. Clinical outcomes associated with overestimation of oxygen saturation by pulse oximetry in patients hospitalized with COVID-19. JAMA Netw. Open 6, e2330856 (2023).
Google Scholar
Ferryman, K., et al. Adherence to FDA guidance on pulse oximetry testing among diverse individuals, 1996-2024. JAMA 333, 631–632 (2025).
Google Scholar
U.S. Food and Drug Administration. Performance Evaluation of Pulse Oximeters Taking into Consideration Skin Pigmentation, Race and Ethnicity. U.S. Food and Drug Administration (FDA) (2024).
Heintz, T. A. et al. Preliminary development and validation of automated nociception recognition using computer vision in perioperative patients. Anesthesiology. (2025).
von Elm, E. et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Epidemiology. 207AD;18:800-804. https://doi.org/10.1097/EDE.0b013e3181577654.
Deng, J. et al. RetinaFace: single-stage dense face localisation in the wild. (2019).
Hugenberg, K. & Wilson, J. P. Faces are central to social cognition. In The Oxford Handbook of Social Cognition. 167–193 (Oxford University Press, 2013).
Mbatha, S. K., Booysen, M. J. & Theart, R. P. Skin tone estimation under diverse lighting conditions. J. Imaging 10, 109 (2024).
Google Scholar
Lewis, C., Cohen, P. R., Bahl, D., Levine, E. M. & Khaliq, W. Race and ethnic categories: a brief review of global terms and nomenclature. Cureus 15, e41253 (2023).
Google Scholar
Kapania S., Taylor A. S. & Wang D. A hunt for the Snark: Annotator Diversity in Data Practices. In: Proc. 2023 CHI Conference on Human Factors in Computing Systems (ACM, 2023) 1–15. https://doi.org/10.1145/3544548.3580645.
Likert, R. A technique for the measurement of attitudes. Arch. Psychol. 22, 55 (1932).
Python Language Reference.
Cronbach, L. J. Coefficient alpha and the internal structure of tests. Psychometrika 16, 297–334 (1951).
Google Scholar
Shrout, P. E. & Fleiss, J. L. Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86, 420–428 (1979).
Google Scholar
Kendall, M. G. A new measure of rank correlation. Biometrika 30, 81–93 (1938).
Google Scholar
Krippendorff, K. Computing Krippendorff’s Alpha-Reliability. (2011).
Cohen, J. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70, 213–220 (1968).
Google Scholar
Ross, A. & Willson, V. L. Paired samples T-test. in Basic and Advanced Statistical Tests (SensePublishers, 2017) 17–19.
Zar, J. H. Spearman rank correlation: overview. in Wiley StatsRef: Statistics Reference Online 1st edn (eds Kenett, R. S., Longford, N. T., Piegorsch, W. W., Ruggeri, F.) (Wiley, 2014) https://doi.org/10.1002/9781118445112.stat05964.
Bland, J. M. & Altman, D. G. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet Lond. Engl. 1, 307–310 (1986).
Google Scholar
link
