1 Genome binning

A metagenome assembly consists of contigs from many different genomes. At this stage we don’t know which contigs are from which species. We could try to taxonomically classify each contig but there are 2 problems with this approach:

  1. Some contigs may be misclassified which can lead to multiple contigs from the same genome/organism being classified as various taxa.
  2. Databases are incomplete and so some contigs will not be classified at all (microbial dark matter).

To alleviate these issues genomic binning can be carried out. This will cluster contigs into bins based on:

  • Coverage: Contigs with similar coverage are more likely to be from the same genome.
  • Composition: Contigs with similar GC content are more likely to belong to the same genome.

Genomic binning has been used to discover many new genomes. Additionally, it makes downstream analyses quicker as the downstream steps will be carried out on the sets of bins rather than on one large metagenome assembly.

Binning produces “bins” of contigs of various quality (e.g. draft, complete). These bins are also known as MAGs (Metagenome-assembled genomes). In other words a MAG is a single assembled genome that was assembled with other genomes in a metagenome assembly but later separated from the other assemblies. The term MAG has been adopted by the GSC (Genomics Standards Consortium).

It is recommended to ensure you do not have a poor quality metagenome assembly. Binning requires contigs of good length and good coverage. Extremely low coverage and very short contigs will be excluded from binning.

1.1 MetaBAT2

We will use MetaBAT2 for our genome binning. It has three major upsides that makes it very popular:

  1. It has very reliable default parameters meaning virtually no parameter optimisation is required.
  2. It performs very well amongst genome binners.
  3. It is computationally efficient compared to other binners (requires less RAM, cores etc.)

Make a new directory and move into it.

#Make directory
mkdir -p ~/7-Binning/K1
#Move into it
cd ~/7-Binning/K1

1.1.1 MetaBAT2: depth calculation

To carry out effective genome binning MetaBAT2 uses coverage information of the contigs. To calculate depth we need to align the reads to the metagenome assembly.

1.1.1.1 Index assembly

For the alignment we will use bwa. We need to index our assembly file prior to alignment.

bwa index ~/6-Assembly/K1/final.contigs.fa

1.1.1.2 Alignment

Next we will align our trimmed paired reads we used to create the stitched reads. We will carry this out with the bwa mem command. bwa mem is a good aligner for short reads. If you are using long reads (PacBio or Nanopore) minimap2 will be more appropriate.

bwa mem ~/6-Assembly/K1/final.contigs.fa \
~/2-Trimmed/K1_R1.fq.gz ~/2-Trimmed/K1_R2.fq.gz > \
K1.sam

1.1.1.3 Sam to sorted bam

After alignment we need to get the file ready for the contig depth summarisation step. This requires converting the sam file to a bam (binary form of a sam file) file and then sorting the bam file.

# Convert sam to bam file
samtools view -bu K1.sam > K1.bam
# Created sorted bam file
samtools sort K1.bam > K1.sort.bam

1.1.1.4 Summarise depths

Now we can summarise the contig depths from the sorted bam files with MetaBAT2’s jgi_summarize_bam_contig_depths command.

jgi_summarize_bam_contig_depths --outputDepth K1.depth.txt K1.sort.bam

1.1.1.5 View summary depth

You can have a look at the depth file and you will notice there are many contigs with low coverage (<10) and of short length (<1500).

less K1.depth.txt

To get a better look we will open the file in R and look at a summary of the file’s table.

Activate R:

R

Now in R we will read in the file and get a summary() of it.

#Read in the table as an object called df (short for data frame)
#We want the first row to be the column names (header=TRUE)
#We do not want R to check the column names and "fix" them (check.names=FALSE)
df <- read.table("K1.depth.txt", header=TRUE, check.names=FALSE)
#Create a summary of the data
summary(df)

The last command gave us summary information of all the columns. This includes the minimum, maximum, mean, median, and Inter-Quartile Range (IQR) values.

We can see the values of the contigLen and totalAvgDepth are very low. However, this is most likely due to a bunch of short and low coverage contigs which will be ignored by MetaBAT2. Therefore we will remove rows with information on contigs shorter than 1500 and rerun the summary. MetaBAT2’s documentation dictates the minimum contig length should be >=1500 with its default being 2500.

#Set the new object "df_min1500len" as all rows
#where the value in the column "contigLen" of "df"
#Is greater than or equal to 1500
df_min1500len <- df[df$contigLen >= 1500,]
#Summary of our new data frame
summary(df_min1500len)

That is looking better. The minimum average coverage for MetaBAT2 is 1 and our minimum value is 2.700 with a maximum of 93.285 (your values may differ slightly). Now you can quit R and continue.

#quit R
q()
#On the prompt to save your workspace press "n" and then enter.

Note: One of the reasons for our short contigs is that we only used a subset of our sequencing dataset for this tutorial due to time concerns.

1.1.2 MetaBAT2: run

With our assembly and its depth information we can run MetaBAT2 for binning.

#make a diretcory for the bins
mkdir bins
#Run MetaBAT2
metabat2 \
--inFile ~/6-Assembly/K1/final.contigs.fa \
--outFile bins/K1 \
--abdFile K1.depth.txt \
--minContig 1500

1.1.2.1 Parameters

  • --inFile: Input metagenome assembly fasta file.
  • --outFile: Prefix of output files.
  • --abdFile: Base depth file.
  • --minContig: Minimum size of contigs to be used for binning.
    • The default is 2500.
    • We used the minimum value of 1500 as we are using tutorial data. We recommend using the default in your own analysis.

1.1.3 MetaBAT2: output

List the contents of the output directory and you’ll see there is 1 fasta file with the prefix of K1. This is a bin that will hopefully contain 1 MAG (Metagenome-Assembled Genome). In your future analysis you may get many bins, each hopefully only having one MAG.

ls bins

1.2 CheckM

CheckM is a useful tool to assess the quality of assembled bacterial and archaeal genomes. This can be used on assemblies produced from single cell, single isolate, or metagenome data. Additionally, it can be used to identify bins that are likely candidates for merging. This occurs when one genome has been separated into different bins.

An important part of CheckM is the ubiquitous and single-copy genes it utilises. It has sets of these genes for different phylogenetic lineages. With these it can determine:

  • What lineage a bin/MAG belongs to.
    • Does it contain genes only found in Escherichia?
  • How complete the bin/MAG is. A set of lineage specific genes should all be found in a genome belonging to the lineage (ubiquitous).
    • What percentage of these lineage specific genes are present in the MAG?
    • >95% is very good
    • >80% is good
    • >70% is ok
    • <70% is poor to poorer
  • How contaminated the bin/MAG is.
    • Only one copy of each gene should be present (single-copy).
    • Are there any markers for other lineages present?

1.2.1 CheckM2: Mamba

Due to program version conflicts we will use the checkm conda environment for this section.

Open a new terminal and activate the checkm environment.

mamba create -n checkm2 -c bioconda -c conda-forge checkm2
mamba activate checkm2

Ensure you are in the correct directory.

cd ~/7-Binning/K1/

1.2.2 CheckM2: run

CheckM2 has many different commands. We will use one of the common workflows it provides called lineage_wf. This carries out five of its commands in a workflow (i.e. the next step uses output from the previous step).

  1. tree - Places bins in the reference genome tree. This reference tree comes with CheckM.
  2. tree_qa - Assess the phylogenetic markers found in each bin.
  3. lineage_set - Infers lineage-specific marker sets for each bin.
  4. analyze - Identifies marker genes in bins.
  5. qa - Assesses the bins for contamination and completeness.

Run the CheckM2 command (this will take a while):

checkm2 database --download
ls checkm2 predict --threads 20 --input bins -x fa --output-directory checkm_output

1.2.2.1 Parameters

  • -x : Suffix/extension of bin files. Other files are ignored in the specified bin directory.
    • fa is used as our MetaBAT2 analysis produced fasta files that end in .fa.
  • bins : The second last argument is the bin_dir, the directory containing all the bins to be analysed in fasta format.
  • checkm_output : The last argument is the directory to store the output to. This directory should not exist prior to running.

1.2.3 CheckM2: output

As we have only used a subset of data the results are not very good. We’ll therefore look at premade results. These premade results were produced using the entire K1 dataset. First you will need to copy them.

cd ~/7-Binning
mkdir K1_fullset
cd K1_fullset
 https://cgr.liv.ac.uk/454/acdarby/LIFE750/Workshops/7-Binning/K1_fullset 

Now we can look at the results table that is in your current directory.

less MAGS_checkm.tsv

CheckM statistics definitions

List of statistics definitions from CheckM wiki

Note: These may not all appear in your MAGS_checkm.tsv file.

  • bin id: unique identifier of genome bin (derived from input fasta file)
  • marker lineage: indicates the taxonomic rank of the lineage-specific marker set used to estimated genome completeness, contamination, and strain heterogeneity. More detailed information about the placement of a genome within the reference genome tree can be obtained with the tree_qa command. The UID indicates the branch within the reference tree used to infer the marker set applied to estimate the bins quality.
  • # genomes: number of reference genomes used to infer the lineage-specific marker set
  • markers: number of marker genes within the inferred lineage-specific marker set
  • marker sets: number of co-located marker sets within the inferred lineage-specific marker set
  • 0-5+: number of times each marker gene is identified
  • completeness: estimated completeness of genome as determined from the presence/absence of marker genes and the expected collocalization of these genes (see Methods in the PeerJ preprint for details)
  • contamination: estimated contamination of genome as determined by the presence of multi-copy marker genes and the expected collocalization of these genes (see Methods in the PeerJ preprint for details)
  • strain heterogeneity: estimated strain heterogeneity as determined from the number of multi-copy marker pairs which exceed a specified amino acid identity threshold (default = 90%). High strain heterogeneity suggests the majority of reported contamination is from one or more closely related organisms (i.e. potentially the same species), while low strain heterogeneity suggests the majority of contamination is from more phylogenetically diverse sources (see Methods in the CheckM manuscript for more details).
  • genome size: number of nucleotides (including unknowns specified by N’s) in the genome
  • # ambiguous bases: number of ambiguous (N’s) bases in the genome
  • # scaffolds: number of scaffolds within the genome
  • # contigs: number of contigs within the genome as determined by splitting scaffolds at any position consisting of more than 10 consecutive ambiguous bases
  • N50 (scaffolds): N50 statistics as calculated over all scaffolds
  • N50 (contigs): N50 statistics as calculated over all contigs
  • longest scaffold: the longest scaffold within the genome
  • longest contig: the longest contig within the genome
  • GC: number of G/C nucleotides relative to all A,C,G, and T nucleotides in the genome
  • coding density: the number of nucleotides within a coding sequence (CDS) relative to all nucleotides in the genome
  • translation table: indicates which genetic code was used to translate nucleotides into amino acids
  • # predicted genes: number of predicted coding sequences (CDS) within the genome as determined using Prodigal

1.2.4 CheckM: Quality score

One quick way to calculate the overall quality of a bin is with the following equation:

\ q = comp - (5 * cont) \ Where:

  • q = Overall quality
  • comp = Completeness
  • cont = Contamination

A score of at least 70-80% would be the aim, with a maximum/perfect value being 100% (100% completeness, 0% contamination). Therefore let us calculate this for the bins with some bash and awk scripting.

1.2.4.1 Tab to comma seperated

First it is good to make a copy of the file in case we make a mistake and want to start over. Additionally, we will make the copy a comma separated file. I find these easier to edit as typing comma characters (,) in commands is more reliable than tab characters (\t).

This is carried out with tr which can translate characters. In the below case we translate/convert the tabs (\t) into commas (,). The converted output is then redirected (>) to the file “MAGS_checkm.csv”.

cat MAGS_checkm.tsv | tr "\t" "," > MAGS_checkm.csv

1.2.4.2 Quality file

We will create a new file with only the quality information. We’ll start by making a file with only a header.

echo "Quality" > MAGS_quality.csv

1.2.4.3 Calculate quality with awk

Next is the most complicated command. We will be calculating the Overall quality (see calculation above) for each row except the header row.

We will be using a complicated linux based language called awk. This is very useful as it can carry out calculations on columns or as awk calls them, fields.

As this is new and complicated we will build up our command step by step.

The first step is to extract the completeness and contamination fields/columns.

awk -F, '{print $12,$13}' MAGS_checkm.csv
  • -F,: Indicates the input fields are separated by commas (,).
  • '': All the awk options are contained within the quotes.
  • {}: We can supply a function to awk within the braces.
  • print $12,$13: This function instructs awk to print the 12th (completeness) and 13th (contamination) fields. It is common to put commas (,) between fields if printing multiple fields.
  • MAGS_checkm.csv: Our last parameter is the input file. We are not changing the contents of the file, only printing information to screen/stdout.

We do not want the header in our calculation so we will add an extra awk option.

awk -F, 'NR>1 {print $12,$13}' MAGS_checkm.csv
  • NR>1: NR stands for number of records. Rows are called records in awk. Therefore NR>1 means awk will only carry out the functions on the records numbered greater than 1. I.e. skip row 1, the header row.

The final step is to carry out the overall quality calculation and append the information to the “MAGS_quality.csv” file.

awk -F, 'NR>1 {print $12 - (5 * $13)}' MAGS_checkm.csv >> MAGS_quality.csv
  • {print $12 - (5 * $13)}: Our new function carries out the overall quality calculation and prints it for each record/row except the first (NR>1).
  • >> MAGS_quality.csv: The printed information is appended (>>) to the file “MAGS_quality.csv”. We append because we want to retain the header we added to the file earlier.

You can view the file to ensure it worked. The first and second values should be 33.62 and -0.89

less MAGS_quality.csv

1.2.4.4 Add quality to the checkm results file

Now we can combine the files “MAGS_checkm.csv” and “MAGS_quality.csv” with the paste command. The -d "," option indicates the merged files will be separated by commas (,), matching the column separation in “MAGS_checkm.csv”.

paste -d "," MAGS_checkm.csv MAGS_quality.csv > MAGS_checkm_quality.csv

1.2.5 CheckM: MCQs

Viewing the file “MAGS_checkm_quality.csv” attempt the below questions.

Tip: You can use the cut command to look at specific columns. For example:

#look at the "Bin Id" and "Quality"
#Convert the printed output's commas to tabs for readability
cut -d "," -f 1,15 MAGS_checkm_quality.csv | tr "," "\t"
  1. What lineage was assigned to bin K1.1?

    root (UID1) k_Bacteria (UID203) o_Lachnospiraceae (UID1286)

  2. What lineage was assigned to bin K1.22?

    root (UID1) k_Bacteria (UID203) o_Lachnospiraceae (UID1286)

  3. What lineage was assigned to bin K1.8?

    root (UID1) k_Bacteria (UID203) o_Lachnospiraceae (UID1286)

  4. What is the quality value of K1.1?

    4.17 33.62 172

  5. What is the completeness value of K1.27?

    4.17 33.62 172

  6. How many genomes are within the K1.12 bin?

    4.17 33.62 172

  7. Which bin has the highest quality value (96.38%)?

    K1.20 K1.21 K1.22

  8. Which bin has the lowest quality value (-300.67%)?

    K1.20 K1.21 K1.22

  9. Which bin has the highest completeness value (98.27%)?

    K1.20 K1.21 K1.22

1.3 Binning summary

It is always useful to know the quality of your bins so you know which are more reliable than others. With that information you can be more or less certain when concluding your findings.

We have some good quality bins. However, the best bins would only have one genome. Ultimately binning is trying to separate all the genomes from each other. A better metagenome assembly would most likely have led to better binning.