Written by: Stephen Hsu
Primary Source: Information Processing
The paper below is one of the best I’ve seen on university rankings. Yes, there is a univariate factor one might characterize as “university quality” that correlates across multiple measures. As I have long suspected, the THE (Times Higher Education) and QS rankings, which are partially survey/reputation based, are biased in favor of UK and Commonwealth universities. There are broad quality bands in which many schools are more or less indistinguishable.
The figure above is from the paper, and the error bars displayed (an advanced concept!) show 95% confidence intervals.
Sadly, many university administrators will not understand the methodology or conclusions of this paper.
This paper uses a Bayesian hierarchical latent trait model, and data from eight different university ranking systems, to measure university quality. There are five contributions. First, I find that ratings tap a unidimensional, underlying trait of university quality. Second, by combining information from different systems, I obtain more accurate ratings than are currently available from any single source. And rather than dropping institutions that receive only a few ratings, the model simply uses whatever information is available. Third, while most ratings focus on point estimates and their attendant ranks, I focus on the uncertainty in quality estimates, showing that the difference between universities ranked 50th and 100th, and 100th and 250th, is insignificant. Finally, by measuring the accuracy of each ranking system, as well as the degree of bias toward universities in particular countries, I am able to rank the rankings.
From the paper:
… The USN-GU, Jeddah, and Shanghai rating systems are the most accurate, with R2 statistics in excess of 0.80.
… Plotting the six eigenvalues from the … global ratings correlation matrix … the observed data is strongly unidimensional: the first eigenvalue is substantially larger than the others …
… This paper describes an attempt to improve existing estimates of university quality by building a Bayesian hierarchical latent trait model and inputting data from eight rankings. There are five main findings. First, despite their different sources of information, ranging from objective indicators, such as citation counts, to subjective reputation surveys, existing rating systems clearly tap a unidimensional latent variable of university quality. Second, the model combines information from multiple rankings, producing estimates of quality that offer more accurate ratings than can be obtained from any single ranking system. Universities that are not rated by one or more rating systems present no problem for the model: they simply receive more uncertain estimates of quality. Third, I find considerable error in measurement: the ratings of universities ranked around 100th position are difficult to distinguish from those ranked close to 30th; similarly for those ranked at 100th and those at 250th. Fourth, each rating system performs at least adequately in measuring university quality. Surprisingly, the national ranking systems are the least accurate, which may be due to their usage of numerous indicators, some extraneous. Finally, three of the six international ranking systems show bias toward the universities in their home country. The two unbiased global rankings, from the Center for World University Rankings in Jeddah, and US News & World Report are also the two most accurate.
To discuss a particular example, here are the inputs (all objective) to the Shanghai (ARWU) rankings:
One could critique these measures in various ways. For example:
Counting Nature and Science papers biases towards life science and away from physical science, computer science, and engineering. Inputs are overall biased toward STEM subjects.
Nobel Prizes are a lagging indicator (ARWU provides an Alternative Rank with prize scoring removed).
Per-capita measures better reflect quality, as opposed to weighting toward quantity (sheer size).
One can see the effects of some of these factors in the figure below. Far left column shows Alternative Rank (prizes removed), Rank in ARWU shows result using all criteria above, and far right column shows scores after per capita normalization to size of faculty. On this last measure, one school dominates all the rest, by margins that may appear shocking ;-)
Note added: Someone asked me about per capita (intensive) vs total quantity (extensive) measures. Suppose there are two physics departments of roughly equal quality, but one with 60 faculty and the other with 30. The former should produce roughly twice the papers, citations, prize winners, and grant support as the latter. If the two departments (without normalization) are roughly equal in these measures, then the latter is probably much higher quality. This argument could be applied to the total faculty of a university. One characteristic that distorts rankings considerably is the presence of a large research medical school and hospital(s). Some schools (Harvard, Stanford, Yale, Michigan, UCSD, Washington, etc.) have them, others (Princeton, Berkeley, MIT, Caltech, etc.) do not. The former group gains an advantage from this medical activity relative to the latter group in aggregate measures of grants, papers, citations, etc. Normalizing by number of faculty helps to remove such distortionary effects. Ideally, one could also normalize these output measures by the degree to which the research is actually reproducible (i.e., real) — this would place much more weight on some fields than others ;-)