Quality control
Now that we have obtained the raw data and inspected it, we should clean it up. With any sequencing data, it is very important to ensure that you use the highest quality data possible: rubbish goes in, rubbish comes out.
There are two main methods employed to clean sequence data, and a third method specific to some metagenomic datasets:
- Remove low-quality bases from the ends of reads. These bases are more likely to be incorrect and should be trimmed.
- Remove adapters. Sequencing adapters can be read if sequencing runs off the end of a fragment.
- Host removal. For host-associated metagenomic samples, it may be advisable to remove reads originating from the host genome.
1 Removing adapters and low-quality bases
Go back to your home directory and create a new directory where we will clean the sequences:
cd
mkdir 2-Trimmed
cd 2-TrimmedYou are now in your newly created directory. Here we will run Trim Galore!, which removes low-quality bases and sequencing adapters.
trim_galore --paired --quality 20 --stringency 4 \
../1-Raw/K1_R1.fastq.gz ../1-Raw/K1_R2.fastq.gzThis command:
- Trims low-quality bases from the ends of reads (quality score < 20)
- Removes adapter sequences if four or more bases are detected
1.1 Task:
Rerun this command for the other two samples (K2 and W1) without referring to the solution below.
K2
trim_galore --paired --quality 20 --stringency 4 \
../1-Raw/K2_R1.fastq.gz ../1-Raw/K2_R2.fastq.gzW1
trim_galore --paired --quality 20 --stringency 4 \
../1-Raw/W1_R1.fastq.gz ../1-Raw/W1_R2.fastq.gz1.2 Rename the files
Once trimming is complete, list the contents of the directory:
lsYou will see new files created: two trimmed read files for each sample, along with trimming reports. The filenames are unnecessarily long, for example:
- K1_R1_val_1.fq.gz
We will rename them to shorter, more consistent names:
mv K1_R1_val_1.fq.gz K1_R1.fq.gz
mv K1_R2_val_2.fq.gz K1_R2.fq.gz
mv K2_R1_val_1.fq.gz K2_R1.fq.gz
mv K2_R2_val_2.fq.gz K2_R2.fq.gz
mv W1_R1_val_1.fq.gz W1_R1.fq.gz
mv W1_R2_val_2.fq.gz W1_R2.fq.gzTip: Press the up arrow key to recall and edit previous commands.
1.3 Task
Briefly inspect the trimming report files (for example K1_R1.fastq.gz_trimming_report.txt). What proportion of reads were trimmed or discarded?
Inspect the trimmed data
To assess the effect of trimming, run FastQC and MultiQC again on the trimmed files.
2 R1 FastQC and MultiQC
mkdir R1_fastqc
fastqc -t 3 -o R1_fastqc *R1.fq.gz
mkdir R1_fastqc/multiqc
multiqc -o R1_fastqc/multiqc R1_fastqc2.1 Task
Run FastQC and MultiQC for the R2 files, then view both R1 and R2 MultiQC reports in Firefox.
Try to run the commands without looking at the solution below.
How does the quality compare to the untrimmed data?
mkdir R2_fastqc
fastqc -t 3 -o R2_fastqc *R2.fq.gz
mkdir R2_fastqc/multiqc
multiqc -o R2_fastqc/multiqc R2_fastqc