UniRef50 or UniRef100 for metagenomics? Sometimes faster is not just faster

Rob Edwards

2 days ago

In metagenomics, we often face a familiar trade-off: do we use a smaller, more clustered reference database that runs quickly, or a larger, more detailed database that may give finer resolution but requires substantially more compute?

One common example is the choice between the clustered databases from UniRef. We use both UniRef50 and UniRef100 for protein-based taxonomic or functional classification. Both databases derive from UniProt, but they differ in clustering level. UniRef50 clusters sequences at 50% identity, producing a smaller, less redundant database. UniRef100 retains much more sequence-level detail and is therefore larger, more comprehensive, and more computationally demanding. At the moment, the mmseqs2 version of UniRef100 is 275G while the mmseqs2 version of UniRef50 is 29G.

The obvious assumption is that UniRef100 should be “better” because it contains more information. But in metagenomics, the more useful question is often: does UniRef100 change the biological interpretation enough to justify the additional runtime?

In one recent comparison using the same metagenomic dataset, the UniRef50 search took 503 seconds using 64 threads. The equivalent UniRef100 search took 9,846 seconds using the same 64 threads.

That is approximately:

8 minutes 23 seconds for UniRef50
2 hours 44 minutes 6 seconds for UniRef100
an almost 20-fold increase in runtime for UniRef100 (even though its ~10x the size)

That is not a minor computational penalty. On a single dataset, it may be acceptable. Across hundreds or thousands of metagenomes, it becomes a major consideration for throughput, queue time, storage, and reproducibility.

What changed in the results?

At both genus and family taxonomic levels, the UniRef50 and UniRef100-derived profiles were broadly concordant, especially after log transformation. This is important because metagenomic abundance tables are typically sparse and highly skewed: a small number of taxa or functions can dominate the raw counts, while most features are rare or absent in most samples.

At the family level, the agreement was especially strong. Comparing family-by-sample abundance values, the raw Pearson correlation was modest, but this was expected because raw count data are sparse and unevenly distributed. After applying a log1p transformation (y=log(1 + x)), the correlation was much stronger. Family-level total abundances across the dataset were also well correlated.

The sample-level totals were almost perfectly correlated between the two database choices: the two approaches largely agreed on which samples had higher or lower overall assignment levels.

However, the two approaches were not identical. UniRef100 assigned more total counts and, in some cases, appeared to resolve reads that were left at broader or more ambiguous taxonomic levels in the UniRef50-derived result.

The practical interpretation

The key point is not that UniRef50 and UniRef100 are interchangeable. They are not. UniRef100 can produce more assignments and may give more specific taxonomic placements in some parts of the tree.

But for many metagenomic questions, especially when working at higher taxonomic ranks such as family, the broad biological signal may be very similar between UniRef50 and UniRef100. If the main goal is to compare samples, identify large-scale community shifts, or screen many datasets efficiently, UniRef50 may be more than adequate.

If the goal is to chase fine-scale taxonomic differences, resolve difficult clades, or maximise the number of assigned reads, UniRef100 may be worth the cost.

A useful rule of thumb

For large-scale metagenomic projects, I would treat UniRef50 as the sensible default for exploratory analysis, cohort-scale comparisons, and routine pipelines. It is faster, cheaper, and often preserves the major biological patterns.

UniRef100 is better reserved for cases where the additional resolution is likely to matter: detailed reanalysis of selected samples, validation of specific signals, fine-grained taxonomic interpretation, or situations where ambiguous assignments from UniRef50 need to be resolved.

A practical workflow might be:

Run the full dataset against UniRef50.
Identify the major biological patterns, outliers, and features of interest.
Re-run selected samples or specific analyses against UniRef100.
Ask whether UniRef100 changes the conclusion, not just whether it changes the counts.

This gives the best of both worlds: speed and scalability from UniRef50, with targeted use of UniRef100 where the extra resolution may be informative.

Bigger databases are not automatically better

Metagenomics already has enough computational bottlenecks. Bigger databases increase runtime, memory pressure, disk usage, and downstream complexity. They can also increase the number of plausible matches, which does not always make interpretation easier.

The important question is not simply, “Which database is larger?” or “Which database is more complete?”

The better question is:

Does the larger database change the biological conclusion enough to justify the additional computational cost?

In my comparison, UniRef100 required nearly 20 times longer than UniRef50. The resulting profiles were not identical, but the major sample-level and family-level abundance patterns were strongly concordant.