Introduction
When people talk about fairness in dermatology AI, the conversation usually starts with model performance, whether a model works worse on darker skin than on lighter skin. But those comparisons rest on an earlier judgment that is easier to miss. Before a model can underperform on a group, someone has to decide which images belong in that group at all.
In dermatology AI, that decision often runs through the Fitzpatrick scale. The scale is widely used to sort images by skin type, but it was designed for something narrower and more clinical, a way of describing how skin responds to sun exposure. That mismatch matters because if two reasonable labeling pipelines sort the same image differently, the fairness groups in a paper can shift before the model ever makes a prediction.
What the Fitzpatrick Scale Measures
The Fitzpatrick scale is a six-step clinical scale, from Type I to Type VI, originally designed around how skin responds to sun exposure. At one end, Type I skin almost always burns; at the other, Type VI skin almost never does, with the remaining types falling somewhere in between.
The Fitzpatrick scale runs from Type I to Type VI. Clinically, it's about how skin reacts to sun exposure, not just how a photo looks.
That makes the scale useful when sun sensitivity matters, whether the question is phototherapy dosing, burn risk, or tanning response. It is much less suited to estimating skin tone from a single photograph.
The gap between those two tasks is larger than it first appears. In clinic, a dermatologist can ask about burning and tanning, look at the skin under consistent lighting, and use the rest of the visit as context. A photo annotator has none of that, only a single image taken under whatever lighting happened to be in the room, with healthy skin, inflamed skin, shadow, and camera color balance all mixed together. That is a different kind of judgment from the one the scale was designed to support.
The Fitzpatrick scale remains the most common skin-type label in dermatology AI papers, which means a clinical concept is regularly being asked to do a computer-vision job.
Why Small Disagreements Matter
At first glance, a one-step disagreement on a six-point scale does not sound especially serious. If one annotator says Type III and another says Type IV, that can seem close enough. The problem shows up later, when those labels get used downstream.
Most papers do not keep the Fitzpatrick types as six separate bins. They collapse them into broader groups and report model performance for each group, often using I–II for “light,” III–IV for “medium,” and V–VI for “dark.”
Once the labels are reduced that way, a one-step disagreement can move an image across a group boundary. A Type II that becomes a Type III no longer belongs to the same subgroup, so the question is not only whether two annotators picked the same exact number. It is whether they placed the image in the same fairness bucket, because that decision shapes the conclusions a paper can draw.
The Public Labels I Compared
To put some numbers on this, I used the public Fitzpatrick17k annotation CSV from the dataset introduced by Groh and colleagues in 2021 (Groh et al. 2021).
The CSV has two consensus columns: fitzpatrick_scale and fitzpatrick_centaur. I read those as the Scale AI and Centaur Labs consensus labeling pipelines. I did not treat either one as ground truth, only as two public attempts to assign image-based Fitzpatrick labels at scale.
After dropping rows where either column was unknown, I was left with 15,230 comparable images. I wanted to know how often the two columns agreed exactly and how often an image stayed in the same broad subgroup once the six types were collapsed into the bins fairness analyses usually use.
Agreement Between the Two Pipelines
Across those images, the two columns matched exactly on 47.89% of cases.
On its own, that sounds alarming. But a six-point scale is fairly granular, and most of the disagreements are small. If you allow a one-step margin, agreement jumps to 91.04%. Allow two steps, and it reaches 98.44%. In most cases, the two pipelines land on the same type or a neighboring one.
| Metric | Value |
|---|---|
| Exact agreement between the two public columns | 47.89% |
| Agreement within 1 Fitzpatrick step | 91.04% |
| Agreement within 2 Fitzpatrick steps | 98.44% |
| Unweighted Cohen's kappa | 0.351 |
| Quadratic-weighted Cohen's kappa | 0.786 |
Seen together, those numbers change the picture. Exact match stays below 50%, but once near-misses count, the two columns look much closer. That is what you would expect from an ordinal scale like this. If two people had to sort thousands of photos onto a six-step ladder, a lot of disagreements would land one rung apart even when they were broadly seeing the same thing.
The Cohen’s kappa values in the table make the same point in a more formal way. Unweighted kappa (0.351) treats every disagreement equally, so a III-vs-IV miss counts just as much as a I-vs-VI miss. Quadratic-weighted kappa (0.786) gives partial credit for near-misses, which fits an ordered scale much better. If exact matching is the standard, the two columns look shaky. If the question is whether they usually land close together, they look much more consistent. Both descriptions are true to the data.
Where the Disagreements Fall
To see where the two pipelines agree and disagree in more detail, it helps to look at the full confusion matrix. Each cell represents a pair of labels: one pipeline assigned the row type, the other assigned the column type. The darker the cell, the more images fell into that pairing.
Most counts sit on the diagonal, where the two pipelines agree exactly, or just beside it, where they are one step apart. That lines up with the summary numbers above: when the pipelines disagree, they usually do so by a small amount.
Where those near-misses fall matters as much as how many there are. The boundary between Types II and III carries a lot of weight on both sides, and the same is true between IV and V. On paper, these are one-step disagreements. In a fairness table, they can move an image into a different subgroup altogether.
What Changes When the Labels Are Collapsed Into Buckets
When I collapsed the six Fitzpatrick types into the three broader groups that fairness analyses usually use, 76.8% of images stayed in the same group regardless of which pipeline I used. The remaining 23.2%, or 3,533 images, moved to a different subgroup.
That is a striking amount of movement. If you were reading a paper that said, “our model achieved 92% accuracy on light skin and 84% accuracy on dark skin,” the eight-point gap would look clean and interpretable. But if almost a quarter of the images in those groups could have landed in a different bucket under another reasonable labeling pipeline, some of that gap may reflect labeling choices as much as model behavior.
That does not make the labels useless, and it does not mean the dataset is broken. It means subgroup membership can shift at scale before the model ever makes a prediction.
How the Group Distribution Shifts
The bucket-switching from the previous section does not simply wash out as random noise. Side by side, the two pipelines produce noticeably different distributions across the three groups.
In this subset, the Scale AI consensus column comes out to about 48.2% light, 37.9% medium, and 13.9% dark. The Centaur Labs consensus column comes out to about 56.3% light, 31.8% medium, and 11.9% dark. Neither distribution is obviously “wrong.” They are both outputs of reasonable human labeling processes. Even so, the fairness analysis built on top of them changes. The subgroup sizes change, the denominators in the accuracy table change, and the number of rare cases inside each group changes with them.
The uncertainty therefore sits at more than the level of any single image. It changes the shape of the dataset as well.
How the Labels Were Produced
The original Fitzpatrick17k paper describes the annotation process in some detail. Images were labeled by human annotators through Scale AI, with each image reviewed by two to five annotators using a dynamic consensus process. Annotator quality was benchmarked against a 312-image gold-standard set labeled by a board-certified dermatologist.
The 2022 follow-up paper (Groh et al. 2022) went further and compared several annotation methods head to head: three board-certified dermatologists, crowd-based protocols through Scale AI, crowd-based protocols through Centaur Labs, and an algorithmic method called ITA-FST.
Taken together, those results change how disagreement in the public CSV should be read. Any two board-certified dermatologists matched exactly on only 50–55% of a 320-image benchmark, but matched within one category on 92–94%. The crowd pipelines performed in a similar range. The ITA-FST algorithm did noticeably worse.
These figures do not point to obviously sloppy labeling. They sit close to what you would expect even when board-certified dermatologists are doing the work. Image-based Fitzpatrick labeling is a subjective task, and the answer changes with the method even when the people doing the labeling are experts.
How to Read Fairness Claims Built on These Labels
I don’t think the takeaway is to stop reporting subgroup performance. If anything, the field needs more of it. A 2021 scoping review in JAMA Dermatology found that only 10% of dermatology AI studies reported skin-tone information in at least one dataset (Daneshjou et al.). A 2024 review in the International Journal of Dermatology highlighted ongoing underrepresentation challenges for skin of color (Fliorent et al.). And a 2022 paper on the Diverse Dermatology Images dataset found that several dermatology AI systems performed worse on dark skin tones and uncommon diseases (Daneshjou et al.). Those are real problems, and subgroup reporting is how we notice them.
But when a paper makes fairness claims by skin tone, it should also be clear about the measurement. What skin-type variable was used? Was it based on metadata, expert review, crowd consensus, or an algorithm? How many annotators were involved, and how were disagreements resolved? How many images were too ambiguous to classify? Was inter-annotator agreement reported, or only the final label?
Without that context, a fairness table can look more definitive than it really is. Treating image-based Fitzpatrick labels as clean ground truth makes the results feel firmer than they are. Skin-tone measurement is part of the evaluation problem, not a solved preprocessing step.
Method Note
The analysis in this post used the public fitzpatrick17k.csv and compared the two exposed consensus columns, fitzpatrick_scale and fitzpatrick_centaur, on rows where both values were not -1. I read those columns as the Scale AI and Centaur Labs consensus labels, then collapsed Fitzpatrick I–II, III–IV, and V–VI into light, medium, and dark subgroup buckets to estimate how often images would move between the coarse groups fairness analyses commonly rely on. This is not a validation study, and I am not treating one column as the real answer. The code, calculations, and figure-generation scripts for the post are in a public repo here: huntercolson1/fitzpatrick17k-label-methods.