Kraken2

Before running Kraken2, we need to define the location of the Kraken2 databases. This is done by setting an environment variable so that Kraken2 knows where to look for its database files.

two ways of running Kraken with a database

kraken2 \
  --db ~/kraken2_db \
  --threads 4 \
  --report kraken_report.txt \
  <path_to_reads.fastq> > kraken_output.txt

Or

export KRAKEN2_DB_PATH=~/kraken2_db

You can inspect the contents of this directory to see that it contains the Kraken database. This database includes only a subset of bacterial, archaeal, and viral genomes and is used here to reduce runtime and computational requirements.

For real analyses, we recommend using a full Kraken2 database, which includes all complete bacterial, archaeal, and viral genomes available in RefSeq at the time of database construction.

Further information on Kraken2 databases:

https://github.com/DerrickWood/kraken2/wiki/Manual#standard-kraken-2-database

https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases

1 Running Kraken2

We will now run Kraken2 on sample K1.

Note:

For this practical, we are using trimmed reads rather than host-removed reads to save time. In real analyses, you should always use host-removed data.

kraken2 --paired \
  --db  ~/kraken2_db\
  --output K1.kraken \
  --report K1.kreport2 \
  ~/2-Trimmed/K1_R1.fq.gz ~/2-Trimmed/K1_R2.fq.gz

Parameters:

  • --paired Indicates paired-end input. Internally, Kraken2 concatenates read pairs with an N between them.
  • --db Specifies the Kraken2 database to use. Since KRAKEN2_DB_PATH has been set, Kraken2 will search for the database inside that directory.
  • --output File containing the per-read classification results.
  • --report Summary report file used by downstream tools such as Bracken.
  • Input files Trimmed paired-end reads for sample K1.

2 Kraken2 output formats

Kraken2 produces two main output files.

2.1 Output file (.kraken)

Per-sequence classification output

Each classified sequence (or sequence pair) produces one line with five tab-delimited fields:

  1. Classification status
    • C = classified
    • U = unclassified
  2. Sequence ID
    • Taken from the FASTQ header
  3. Assigned taxonomy ID
    • 0 if unclassified
  4. Sequence length (bp)
    • Paired reads are reported as R1|R2
      (e.g. 98|94)
  5. Lowest Common Ancestor (LCA)
    • Mapping of k-mers used for classification

Paired-end data contain a |:| token indicating the boundary between reads.

2.2 Report file (.kreport2)

This file summarises taxonomic classification and is required for Bracken.

3 Output file columns

The columns are:

  1. Percentage of paired reads assigned to the clade
  2. Number of paired reads assigned to the clade
  3. Number of paired reads assigned directly to the taxon
  4. Rank code
    (U, R, D, K, P, C, O, F, G, S)
  5. NCBI taxonomic ID
  6. Scientific name (indented by taxonomic depth)

3.1 Screen output

Kraken2 also prints a summary to the screen showing how many sequences were classified. This value will be lower than expected due to the use of a mini database.

4 Using the --confidence option

In real analyses, you may wish to use the --confidence option,
which sets a minimum confidence score for classification.

  • Default: 0.0
  • Maximum: 1.0
  • Typical starting value: 0.1

With the mini Kraken2 database used in this practical, applying a confidence threshold removes too many classifications.

Further information: https://github.com/DerrickWood/kraken2/wiki/Manual#confidence-scoring

5 Task

Once the Kraken2 command has finished running for K1, repeat the analysis for K2 and W1.

You will need to replace all instances of K1 with K2 or W1 in the command above.

#K2
kraken2 --paired --db ~/kraken2_db \
--output K2.kraken --report K2.kreport2 \
~/2-Trimmed/K2_R1.fq.gz ~/2-Trimmed/K2_R2.fq.gz

#W1
kraken2 --paired --db ~/kraken2_db \
--output W1.kraken --report W1.kreport2 \
~/2-Trimmed/W1_R1.fq.gz ~/2-Trimmed/W1_R2.fq.gz