Raw data
The very first thing we need to do is to obtain a dataset to work with. The European Bioinformatics Institute (EBI) provides an excellent metagenomics resource (https://www.ebi.ac.uk/metagenomics/) which allows users to download publicly available metagenomic and metagenetic datasets.
Have a browse of some of the projects by selecting one of the biomes on the website.
We have selected a dataset from this site that consists of DNA shotgun data generated from 24 human faecal samples. Twelve of these samples are from subjects who were fed a Western diet and twelve are from subjects who were fed a Korean diet. This dataset comes from the EBI metagenomics resource:
https://www.ebi.ac.uk/metagenomics/projects/ERP005558
1 Obtaining the data
First, we need to create a directory to put the data in and then change directory to it.
mkdir 1-Raw
cd 1-Raw
wget -r -np -nH --cut-dirs=3 -A "*.fastq.gz" https://cgr.liv.ac.uk/454/acdarby/LIFE750/
#return to home What this wget command does:
-r– recursive download
-np– no parent directory traversal
-nH– do not create the host directory (cgr.liv.ac.uk)
--cut-dirs=3– strip454/acdarby/LIFE750from the local path
-A "*.fastq.gz"– download only FASTQ.gzfiles
There should be six files in the directory, two for each sample in the dataset (e.g. K1_R1.fastq.gz).
2 FASTQ files in this dataset
There should be six files in the directory — two for each sample
(e.g. K1_R1.fastq.gz).
2.1 File naming structure
The file ID has three components:
K1→ sample ID
R1→ forward read in the Illumina read pairR2corresponds to the reverse read
fastq.gz→ gzipped FASTQ file
2.2 Samples included
- K1 → Faecal sample from an individual on a Korean diet
- K2 → Faecal sample from an individual on a Korean diet
- W1 → Faecal sample from an individual on a Western diet
3 Paired-end reads
In Illumina sequencing, the vast majority of reads are paired-end. DNA is first fragmented, and both ends of each fragment are sequenced.
This results in two sequences for each fragment:
R1: one end of the fragmentR2: the opposite end of the fragment
FASTQ is a sequence format similar to FASTA, with the addition of quality scores.
To inspect the first few lines of a FASTQ file:
zcat K1_R1.fastq.gz | head -n 4 | less -SThe pipe symbol (|) is used to pass the output of one command as input to the next. This command:
1. Unzips the FASTQ file
2. Displays the first four lines
3. Displays them without line wrapping (-S)
These four lines represent a single FASTQ entry (one read):
- Line 1: read identifier
- Line 2: nucleotide sequence
- Line 3: secondary header (usually +)
- Line 4: quality scores for each base
To return to the command prompt, press q.
Due to computational constraints, the files you are using are a subset of the original dataset (1 million read pairs per sample).
4 Checking quality control
We can assess sequence quality using FastQC. We will run FastQC on R1 and R2 reads separately, as they often show different quality profiles (R2 is typically lower quality).
4.1 Running FastQC
R1 FastQC
mkdir R1_fastqc
fastqc -t 3 -o R1_fastqc *R1.fastq.gzR2 FastQC
mkdir R2_fastqc
fastqc -t 3 -o R2_fastqc *R2.fastq.gz4.2 Summarising with MultiQC
Once FastQC has finished, we use MultiQC to combine the results into interactive HTML reports.
R1 MultiQC report
mkdir R1_fastqc/multiqc
multiqc -o R1_fastqc/multiqc R1_fastqcR2 MultiQC report
mkdir R2_fastqc/multiqc
multiqc -o R2_fastqc/multiqc R2_fastqcViewing the reports
firefox R1_fastqc/multiqc/multiqc_report.html \
R2_fastqc/multiqc/multiqc_report.html &NOTE that the commands here produce a nested directory structure.
The & runs the command in the background, allowing you to continue working while Firefox is open.
This command is split across multiple lines using a bash escape \. When a line ends with \, press Enter to continue writing the command on the next line.
For more information, see the Intro to Unix materials:
https://neof-workshops.github.io/Unix_nxcdf7/Course/05-Tips_and_tricks.html#bash-escape
4.3 Interpreting the reports
The FastQC results (viewed via MultiQC) include several useful metrics. The Sequence Quality Histograms show how quality changes along the length of the reads. Quality typically decreases toward the end of reads, especially for R2.
This behaviour is normal for Illumina sequencing. In the next chapter, we will improve read quality through trimming.
For more information on FastQC plots, see: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Once you have finished inspecting the reports, minimise the Firefox window.