The intuition, as I understand it, is that we want to be "almost sure" that A is better than B. A performs almost always better than its lower bound (assuming that lower is better). This lower bound is better than B's upper bound. B performs almost always worse than its upper bound. A > lb(A) > ub(B) > B. The first and last inequality are probabilistic statements, the one in the middle is a comparison of two numbers. That said, I am not sure that's how I would go about it or exactly what the statistical meaning of it is. I think I would directly apply a two-sample test like the Mann Whitney U test. Let' say that A and B are variants of the same algorithm or for whatever other reason their performance on any input is highly correlated, but A is almost always slightly better. Then you may find their difference to be significantly positive (+1). But since their performance is very variable (CIs are, say [10, 20] and [11, 21] resp.), you wouldn't be able to come to the same conclusion by the comparison method you gave, because 95% CI intervals for both algorithms overlap. So if the only thing you have available are the two confidence intervals, sure, but if you have a sample of the performance of both algorithms, I ...