Raw data

The very first thing we need to do is to obtain a dataset to work with. The European Bioinformatics Institute (EBI) provides an excellent metagenomics resource (https://www.ebi.ac.uk/metagenomics/) which allows users to download publicly available metagenomic and metagenetic datasets.

Have a browse of some of the projects by selecting one of the biomes on the website.

We have selected a dataset from this site that consists of DNA shotgun data generated from 24 human faecal samples. Twelve of these samples are from subjects who were fed a Western diet and twelve are from subjects who were fed a Korean diet. This dataset comes from the EBI metagenomics resource:

https://www.ebi.ac.uk/metagenomics/projects/ERP005558

1 Obtaining the data

First, we need to create a directory to put the data in and then change directory to it.

mkdir 1-Raw
cd 1-Raw

wget -r -np -nH --cut-dirs=3 -A "*.fastq.gz" https://cgr.liv.ac.uk/454/acdarby/LIFE750/
#return to home 

What this wget command does:

  • -r – recursive download
  • -np – no parent directory traversal
  • -nH – do not create the host directory (cgr.liv.ac.uk)
  • --cut-dirs=3 – strip 454/acdarby/LIFE750 from the local path
  • -A "*.fastq.gz" – download only FASTQ .gz files

There should be six files in the directory, two for each sample in the dataset (e.g. K1_R1.fastq.gz).

2 FASTQ files in this dataset

There should be six files in the directory — two for each sample
(e.g. K1_R1.fastq.gz).

2.1 File naming structure

The file ID has three components:

  • K1 → sample ID
  • R1 → forward read in the Illumina read pair
    • R2 corresponds to the reverse read
  • fastq.gz → gzipped FASTQ file

2.2 Samples included

  • K1 → Faecal sample from an individual on a Korean diet
  • K2 → Faecal sample from an individual on a Korean diet
  • W1 → Faecal sample from an individual on a Western diet

3 Paired-end reads

In Illumina sequencing, the vast majority of reads are paired-end. DNA is first fragmented, and both ends of each fragment are sequenced.

This results in two sequences for each fragment:

  • R1: one end of the fragment
  • R2: the opposite end of the fragment

FASTQ is a sequence format similar to FASTA, with the addition of quality scores.

To inspect the first few lines of a FASTQ file:

zcat K1_R1.fastq.gz | head -n 4 | less -S

The pipe symbol (|) is used to pass the output of one command as input to the next. This command:

1.  Unzips the FASTQ file
2.  Displays the first four lines
3.  Displays them without line wrapping (-S)

These four lines represent a single FASTQ entry (one read):

  • Line 1: read identifier
  • Line 2: nucleotide sequence
  • Line 3: secondary header (usually +)
  • Line 4: quality scores for each base

To return to the command prompt, press q.

Due to computational constraints, the files you are using are a subset of the original dataset (1 million read pairs per sample).

4 Checking quality control

We can assess sequence quality using FastQC. We will run FastQC on R1 and R2 reads separately, as they often show different quality profiles (R2 is typically lower quality).

4.1 Running FastQC

R1 FastQC

mkdir R1_fastqc
fastqc -t 3 -o R1_fastqc *R1.fastq.gz

R2 FastQC

mkdir R2_fastqc
fastqc -t 3 -o R2_fastqc *R2.fastq.gz

4.2 Summarising with MultiQC

Once FastQC has finished, we use MultiQC to combine the results into interactive HTML reports.

R1 MultiQC report

mkdir R1_fastqc/multiqc
multiqc -o R1_fastqc/multiqc R1_fastqc

R2 MultiQC report

mkdir R2_fastqc/multiqc
multiqc -o R2_fastqc/multiqc R2_fastqc

Viewing the reports

firefox R1_fastqc/multiqc/multiqc_report.html \
R2_fastqc/multiqc/multiqc_report.html &

NOTE that the commands here produce a nested directory structure.

The & runs the command in the background, allowing you to continue working while Firefox is open.

This command is split across multiple lines using a bash escape \. When a line ends with \, press Enter to continue writing the command on the next line.

For more information, see the Intro to Unix materials:

https://neof-workshops.github.io/Unix_nxcdf7/Course/05-Tips_and_tricks.html#bash-escape

4.3 Interpreting the reports

The FastQC results (viewed via MultiQC) include several useful metrics. The Sequence Quality Histograms show how quality changes along the length of the reads. Quality typically decreases toward the end of reads, especially for R2.

This behaviour is normal for Illumina sequencing. In the next chapter, we will improve read quality through trimming.

For more information on FastQC plots, see: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Once you have finished inspecting the reports, minimise the Firefox window.