a.
De-novo
genome assembly
Raw WMS
reads
Quality Control
Trimommatic
Bowtie2
Quality
controlled
reads
Assembly
MetaSPAdes
MEGAHIT
Contigs
Binning
(MetaBAT2, MaxBin2,
CONCOCT)→MetaWRAP
Genome
Bins
Bin QC
CheckM
UHGG
All MAG
(286,997)
UHGG
(4,644)
KIJ
Genomes
(29,082)
Dereplication
of genomes
(1st)
Mash
dRep
KIJ
Species
Representatives
(2,199)
Dereplication
of genomes
(2nd)
Mash
dRep
HRGM
Genomes
(5,414)
b. Genome catalog
Coding sequence prediction
Prodigal
KIJ
redundant
Proteins
(64.7M)
UHGP
redundant
Proteins
(625.3M)
Redundancy
removal
(1st)
CD-HIT
KIJ
Proteins
CD
-
HIT 100
(20.6M)
UHGP-100
(170.6M)
Redundancy
removal
(2nd)
CD-HIT
HRGM
Proteins
100% (103.7M)
95% (20.0M)
90% (14.8M)
70% (8.5M)
50% (4.7M)
c. Protein catalog
Intermediate data
UHGG data
HRGM data
Data processing and software