Since I launched my color cycle survey in December, it has collected ~9.7k responses across ~800 user sessions. Although the responses are not as numerous as I’d like, there’s currently enough data for preliminary analysis. The data are split between sets of six, eight, and ten colors with ratios of approximately 2:2:1; there are fewer ten-color color set responses as I disabled that portion of the survey months ago, to more quickly record six- and eight-color color set responses. So far, I’ve focused on analyzing the set ranking of the six-color color sets, for which there are ~4k responses, using artificial neural networks. The gist of the problem is to use the survey’s pair-wise responses to train a neural network such that it can rank 10k previously-generated color sets; these colors sets each have a minimum perceptual distance between colors, both with and without color vision deficiency simulations applied.
As inputs with identical structure are being compared, a network architecture that is invariant to input order, i.e., one that produces identical output for inputs (A, B) and (B, A), is desirable. Conjoined neural networks1 satisfy this property; they consist of two identical neural networks with shared weights, the outputs of which are combined to produce a single result. In this case, each network takes a single color set as input and produces a single scalar output, a “score” for the input color set. The two scores are then compared, with the better scoring color set of the input pair chosen as the preferred set; put more concretely, the difference of the two scores is computed and used to calculate binary cross-entropy during network training. The architecture of the network appears in the figure below and contains 2077 trainable parameters.
Each color set consists of six colors, which are each encoded in the perceptually-uniform CAM02-UCS colorspace, with J encoding the lightness and a and b encoding the chromacity. The first two layers of the network are used to fit an optimal encoding to each of the color inputs; this is achieved by using a pair of three-neuron fully-connected layers for each of the six colors, with network weights shared between each sub-layer. The outputs of these color-encoding layers are then concatenated and fed to two more fully-connected layers, consisting of thirty-six neurons each. A final fully-connected layer consisting of single neuron is then use to produce a single scalar output. The entire network is then duplicated for the second color set being compared, and the difference between the two outputs is computed. Exponential linear unit (ELU) activation functions are used on the interior layers, and a sigmoid activation function is used on the final layer of each network.
The colors in each color set are ordered by hue, then chromacity, then lightness. This is a sensible ordering, but since hue is cyclic, the starting color is fairly arbitrary. Thus, before training the network, the data are augmented by performing cyclic shifts on the ordering of the six colors in each set. As this augmentation is performed on each of the two color sets in each survey response pair, the total training and test data set sizes are augmented by a factor of thirty-six. Prior to data augmentation, the survey response data are split, with 80% used as the training set and 20% used as the test set. In order to reduce overfitting, Gaussian dropout is used on both of the 36-neuron layers, with a rate of 0.4; L2 kernel regularizers are used on all layers, with a penalty of 0.001. The network was implemented using Keras, with the TensorFlow backend, and trained using binary-crossentropy and the Nesterov Adam optimizer, using default optimizer parameters.
Unfortunately, training this network proved to be problematic, with it often converging into a local minimum with a loss of 0.6931 ≈ ln(0.5); the network was learning to ignore the inputs and always produce the same output, resulting in an output of zero from the conjoined network. Previous work with conjoined networks did not run into this problem, since either higher dimensionality output was used to compute a similarity metric2 or non-binary training data were used.3 To resolve this issue, the output comparison was removed as well as the last fully-connected layer of each network; this was replaced with a single-neuron fully-connected layer with sigmoid activation, joining the two existing networks into a single network with a single output. As this is no longer a conjoined architecture but instead a single network, the input order matters, so the data were additionally augmented such that both ordering of each survey response pair would be used, doubling the number of training and test pairs.
With this change, the network could be successfully trained. However, this new network only worked with pair-wise data, which was troublesome. The 10k color sets to be ranked can be paired close to fifty million ways, which grows to more than three billion inputs to evaluate once the data augmentation is applied. The conjoined network, however, requires only 60k evaluations for the ranking, since a single instance of the network, without the output comparison, can be used to directly score a given color set. Thus, a hybrid approach was devised. The single-output non-conjoined network was first trained for fifty epochs. Its last layer was then removed, and the change to the original conjoined network was undone, but the existing training weights were kept. This partially pre-trained conjoined network was then trained for an additional fifty epochs. Due to the pre-training, the conjoined network no longer became stuck in the local minimum, allowing the advantages of the conjoined network to be reaped, while avoiding the training dilemma.
Since the training data only very sparsely cover the space of possible pairing and since the network does not always training consistently well, I decided it was best to train an ensemble of model instances. To this end, I trained the model 100 times, chose the best fifty instances as determined by the metric
training accuracy + test accuracy - abs(training accuracy - test accuracy), calculated scores for each of the 10k color sets using these fifty trained model instances, and averaged the resulting scores for each color set. For both the training and test sets, the average accuracy was 58%. While considerably better than guessing randomly, it does seem a bit low at first glance. However, many of the color sets are similar and aesthetic preference is subjective, so perfect accuracy isn’t possible. To approximate an upper limit on achievable accuracy, I created a modified version of the color cycle survey that always presents the same six-color color sets in the same order and then entered 100 responses each of two consecutive days; 83 / 100 of my answers were consistent for the color set preference between the two days. Thus, I think 80% is a conservative upper limit on possible accuracy; including aesthetic preference differences between individuals, I think ~70% is a more practical upper limit for achievable accuracy.
A few variants of the network were evaluated, such as increasing or decreasing the number of layers or the size of the layers, as well as changing the activation functions. Adding additional layers or increasing the size of the existing layers did not appear to have an effect on the accuracy; removing one each of the color encoding and set encoding layers only led to at most a marginal decrease in accuracy. Using rectified linear unit (ReLU) activations on the interior layers led to marginally decreased accuracy. Adjusting the Gaussian dropout rate by 0.1 or 0.2 had little effect, and Gaussian dropout seems to work slightly better than standard dropout. Originally, a hue-chromacity-luminance representation was used for the color inputs, as is used to sort the input color order, but this had noticeably decreased accuracy; I suspect that the cyclic nature of hue values was the source of this reduced accuracy.
In addition to making the results more stable, this ensemble also allows for estimating the uncertainty between training runs; the plot below shows the average color set scores as a function of rank, with a 1-sigma error band.
This shows that according to the model, that while the best color sets are definitely better than the worst color sets, color sets that are close in ranking are not necessarily any better or worse than the hundreds of color sets with similar rankings. Given the sparsity of the input data, this result is not surprising. The results can also be evaluated qualitatively; the figure below shows the fifteen lowest ranked color sets on the left and the fifteen highest ranked color sets on the right.
To my eye, the best color sets definitely look better than the worst color sets. The worst sets appear to be darker, more saturated, and generally a bit garish; note that the lightness and color distance limits applied when the color sets were generated excluded the vast majority of truly awful color sets for this evaluation. I find the highest-ranked color set, as well as many of the other highly-ranked color sets, to be quite pleasant; some of the other highly-ranked color sets contain
blueish purplish colors that I find to be a bit over-saturated, so there’s definitely still room for improvement.
I hope that this post convincingly shows the validity of the data-driven premise on which the color cycle survey is based. It was certainly a relief to me when I was first able to get test accuracy results consistently above 50%, since it meant there wasn’t an egregious mistake in the survey code; seeing consistent color set rankings between training runs gave further relief, since it showed that the concept was working as I had hoped. Moving forward, I plan to next consider color cycle ordering for the six-color color sets. The initial plan is to use the same network architecture but to train it with the color cycle ordering responses (three pairs per response); the trained network could then be used to determine an optimal ordering by ranking the 720 possible six-color cycle orderings for a given color set and choosing the highest-ranked ordering. Once I have a workable cycle ordering analysis technique, I’ll apply both the set choice and cycle ordering analyses to the eight-color color set data, which will hopefully be straightforward.
Another interesting avenue to pursue would be to try to create a single network that can handle various sized color cycles, as this would allow all of the survey results to be used at once and would allow the results to be generalized beyond the number of colors used in the survey; however, I’m not yet sure how to approach this. An additional thought is to devise a metric that combines the network-derived score with some sort of color-nameability criterion, probably derived from the xkcd color survey, and use that to rank the color sets, favoring colors that can more easily be named, instead of just using the network-derived score directly. As I mentioned at the beginning of this post, I’d really like more data with which to improve the analysis; with increased confidence from these preliminary results, I’ll try to further promote the color cycle survey.
If you haven’t yet taken the color cycle survey (or even if you have), please consider taking it: https://colorcyclesurvey.mpetroff.net/
Bromley, Jane, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. “Signature verification using a ‘Siamese’ time delay neural network.” In Advances in neural information processing systems, pp. 737-744. 1994. ↩
Koch, Gregory, Richard Zemel, and Ruslan Salakhutdinov. “Siamese neural networks for one-shot image recognition.” In ICML deep learning workshop, vol. 2. 2015. ↩
Burges, Christopher, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N. Hullender. “Learning to rank using gradient descent.” In Proceedings of the 22nd International Conference on Machine learning (ICML-05), pp. 89-96. 2005. doi:10.1145/1102351.1102363 ↩