Host Removal

It is good practice to remove host-derived sequences from sequencing data before downstream analysis. A common approach is to align reads to a host reference genome and remove any reads that successfully map to that reference.

If a host genome is not available prior to sample collection and sequencing, it may be necessary to sequence and assemble the host genome. Long-read sequencing technologies are recommended for single-genome assembly projects.

This chapter demonstrates host removal using a small subset of the human reference genome. In real analyses, the complete host genome reference should always be used.

1 Reference preparation

Copy the host reference FASTA file into your working directory.

mkdir 2-5_Host_removed
cd 2-5_Host_removed
wget -r -np -nH --cut-dirs=3 -A "GRCh38_slice.fasta" https://cgr.liv.ac.uk/454/acdarby/LIFE750/

2 Indexing the reference

We will use Bowtie2 to align reads to the host genome. Before alignment, the reference must be indexed.

bowtie2-build GRCh38_slice.fasta GRCh38_slice.fasta

After indexing, several files ending in .bt2 will be created. These files are required by Bowtie2 during alignment.

3 Alignment to the host genome

The K1 paired-end reads are aligned to the host reference. The output is converted directly to BAM format using samtools.

bowtie2 -x GRCh38_slice.fasta   -1 ../K1_R1.fq.gz -2 ../K1_R2.fq.gz -p 12 2> K1_bowtie2_out.log | samtools view -b -S -h > K1_mapped.bam

The K1_bowtie2_out.log file contains alignment statistics and any error messages produced by Bowtie2.

4 Extracting unmapped reads

Reads that did not map to the host genome are extracted using samtools fastq.

samtools fastq -f 4   -1 ../K1_R1.u.fq   -2 ../K1_R2.u.fq   K1_mapped.bam

The -f 4 flag specifies that only unmapped reads should be output.

At this stage, some read pairs may become unpaired if only one read in a pair mapped to the host genome.

5 Re-pairing reads

To ensure that downstream analyses receive properly paired reads, we remove unpaired reads using BBTools.

repair.sh \
  in1=../K1_R1.u.fq \
  in2=../K1_R2.u.fq \
  out1=K1_R1.final.fastq \
  out2=K1_R2.final.fastq \
  outs=K1_singletons.fastq

The K1_singletons.fastq file contains reads without a matching pair and can usually be ignored.

6 Summary

Host removal for the K1 sample is completed. As this dataset contains very little human DNA, host removal will be skipped for the remaining samples, and the trimmed reads will be used directly for downstream analyses. In real projects, host removal should always be performed using a complete host genome reference. For this practical, a random 10 kb slice of the human genome was used to reduce runtime.

7 Further resources:

Human genome overview: https://www.ncbi.nlm.nih.gov/genome/guide/human/
Reference assembly: https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz