Last year, I published a blog post on figure caption color indicators. The positive feedback I received on it from a number of individuals prompted me to revisit the subject. At the time, I did not have a good way of locating published examples of such caption indicators and was only able to locate a few published examples with shape indicators but none with color indicators. When thinking about revisiting the subject, I had the epiphany that although searching for such indicators in the published literature is next to impossible, searching in the LaTeX source markup for papers is not. As arXiv provides bulk access to the TeX source markup for its pre-prints, this provided a large corpus of manuscripts to search through. After finding examples in pre-prints, I was then able to see if the indicators survived the publication process and was thereby able to locate well over one hundred examples of color line or shape indicators in the figure captions of published academic papers.
I broke the process into four steps: acquiring the data, extracting LaTeX commands from caption environments, finding potential figure caption candidates, and verifying these candidates. As the arXiv source archive is well over 1 TB in size, it is provided in an AWS S3 bucket configured such that the requester pays for bandwidth, which would result in a bandwidth bill of >$100 if downloaded directly. As I was only interested in the TeX source and not the figures, which account for most of the total file size, and since AWS does not charge to transfer between S3 buckets and EC2 instances in the same region, I first ran a script on an EC2 instance to download from arXiv’s S3 bucket and extract and repackage just the TeX source files. This allowed me to greatly reduce the amount of data transfer required and allowed me to download the full TeX source file corpus for <$5. Next, I used the TexSoup Python package to process the TeX files and produce a list of LaTeX commands used in the caption
environment. I then used a final script to search for papers that used command names that referenced colors or shapes to compile a list of likely paper candidates and produced HTML files for each year containing a link to the PDF for each candidate paper as well as the full TeX source for the identified caption, with the matching commands highlighted. Finally, I manually verified the papers using the HTML files that were produced. Except for trivial false positivies, which could be identified by looking at the included caption source, I manually looked at the PDF for each candidate paper, verified that it included a visual caption indicator, and classified the caption indicator if it had one. For papers that included indicators, I then attempted to locate the published version of record of the paper and did the same for it.
Through this process, my scripts located around ~5100 paper candidates from the beginning of arXiv in 1992 through the end of June 2020. I manually verified these candidates for papers submitted prior to the end of 2016; these accounted for ~2000 candidates, of which I verified ~1100 papers to have some sort of visual caption indicator. For ~700 of these, I was able to verify the presence of some form of visual caption indicator in the published version of record. Of these, ~60% included a black shape or line indicator, ~25% included a color shape or line indicator, and the remainder included colored text. The fraction of papers with color shape or line indicators was higher in the pre-prints, since it was not uncommon for the published version to include a black indicator when the pre-print included a colored indicator. I stopped at the end of 2016 since the verification process was quite time consuming, and I could only look at so many papers before giving up.
These findings show that the idea of using figure color caption indicators is by no means a new idea. However, it’s still quite rare in relative terms, since at most a couple thousand out of arXiv’s ~1.7 million pre-prints include such indicators. Most of the examples I found used a colored shape (■) or line (—) in parentheses, or both in cases where both a line and marker were used. My proposal to use a colored underline does still appear to have been a novel concept, but it proved quite complicated to implement, so using shapes or lines in parentheses is much more practical, since it is simpler and is evidentially compatible with many publishers’ workflows. Furthermore, the existing examples can be used as evidence when complaining about paper proofs, after the typesetter predictably removes the indicators, to show that the indicators are possible and that they can and should be included in the final published version of the paper.
One color indicator that I recommend against using is colored text, since it can be difficult to read and often violates WCAG contrast guidelines. Its use seems particularly common in the computer vision literature and, to a lesser degree, the machine learning literature. It is often used to highlight table entries, a purpose much better served by using italic, bold, or bold–italic text.
I have made the scripts used for this analysis, the paper candidates, and the final verified results available. The final verified results are also available separately for easy viewing. Note that the verified results are incomplete and may contain errors.