# Quality control and trimming of reads¶

For our purposes, let’s create a project directory for all of our files for the transcriptomics component of the workshop

Update the machine and install some software:

sudo bash
cd /mnt
apt-get update
apt-get install -y git trimmomatic fastqc bowtie bwa build-essential python2.7-dev python-numpy python-htseq default-jre r-base r-base-dev r-bioc-edger


Make sure you’re in your home directory and create the folders:

cd ~
mkdir TranscriptomicsWorkshop Desktop


Great! Now let’s change directory into the folder, create a few more folders and the first README:

cd TranscriptomicsWorkshop
mkdir RawData QC Alignment TranscriptAbund DiffExp
mkdir QC/Trimmomatic QC/FastQC


You are now in the terminal text editor called “nano”. Enter a couple sentences describing what the project will entail, save (Ctrl + O and then hit Enter/Return), and exit (Ctrl + X).

wget http://s3-us-west-1.amazonaws.com/microgenomicstranscriptomics/gly7a.fq.gz
wget http://s3-us-west-1.amazonaws.com/microgenomicstranscriptomics/gly7b.fq.gz
wget http://s3-us-west-1.amazonaws.com/microgenomicstranscriptomics/gly5a.fq.gz
wget http://s3-us-west-1.amazonaws.com/microgenomicstranscriptomics/gly5b.fq.gz
wget http://s3-us-west-1.amazonaws.com/microgenomicstranscriptomics/pyr7a.fq.gz
wget http://s3-us-west-1.amazonaws.com/microgenomicstranscriptomics/pyr7b.fq.gz
wget http://s3-us-west-1.amazonaws.com/microgenomicstranscriptomics/pyr5a.fq.gz
wget http://s3-us-west-1.amazonaws.com/microgenomicstranscriptomics/pyr5b.fq.gz


This is a subset of data from Baker et al.. It contains the first 100,000 reads from a study looking at how M. tuberculosis changes its metabolism in response to different carbon sources at neutral and acidic pH.

The data are in FASTQ format and look similar to the reads we used yesterday.

Before we move on, let’s create another README in the data folder and copy and paste the above lines describing the data, how we got the data, and include the date. Something like this:

cd RawData


Date data was downloaded: 9-25-15
Data description:

This is a subset of data from Baker et al. 2013 Mol. Micro. It contains the first
100,000 sequences from a study looking at how M. tuberculosis changes its
metabolism in response to different carbon sources at neutral and acidic pH.

git clone http://www.github.com/biobenkj/sample_data RawData

Data files:

find *.fq.gz

gly5a.fq.gz
gly5b.fq.gz
gly7a.fq.gz
gly7b.fq.gz
pyr5a.fq.gz
pyr5b.fq.gz
pyr7a.fq.gz
pyr7b.fq.gz


Great! Let’s move on to quality control and trimming our reads.

Since we covered Trimmomatic and FastQC yesterday, we will move through this section quickly, but pause for considerations in using FastQC for RNA-seq data.

## Trimmomatic¶

You should be in the RawData folder currently. If you aren’t:

cd ~/TranscriptomicsWorkshop/RawData


Trim the reads by scripting! (What is happening in the first line? What are the parameters to trim? What in the world are we doing in the first place?!):

TRIM=/mnt/Trimmomatic-0.33
for untrimmedreads in *.fq.gz
do
#get the file name
FILENAME=basename ${untrimmedreads%.*.*} #set a prefix to make understanding what has been done to the file easier PREFIX=trimmed #create the new file name (e.g. trimmedgly7a.fq.gz) NEWTRIMFILE=${PREFIX}${FILENAME} #do the trimming java -jar$TRIM/trimmomatic-0.33.jar SE $untrimmedreads ../QC/Trimmomatic/$NEWTRIMFILE.fq.gz \
ILLUMINACLIP:$TRIM/adapters/TruSeq3-SE.fa:2:30:10 \ LEADING:3 \ TRAILING:3 \ SLIDINGWINDOW:4:15 \ MINLEN:36 done  ## FastQC¶ Now let’s generate the FastQC reports (What does that -o . thing do?): cd ../FastQC for trimmedreads in ../Trimmomatic/*.fq.gz do fastqc -o .$trimmedreads
done


It would be a good idea to make README files here to describe what happened, what script we used, etc. But for now, in light of time, we will forge ahead. Just wanted to make the point ;) Let’s talk about what this script is doing so we know how we can script the analysis for all the files.

Look at the FastQC reports:

How does the data look? Any issues? Have a look at some of the FastQC documentation for the things that may not look “good” (e.g. orange and red flags for certain metrics) in relation to RNA-seq data. It’s important to remember that FastQC was originally made for genome sequences.

Another tool that produces a pretty neat 3-D plot is called FaQCs.

This is what it looks like:

FaQCs takes a rather long time to run, but produces a nice pdf report when it’s done.

LICENSE: This documentation and all textual/graphic site content is licensed under the Creative Commons - 0 License (CC0) -- fork @ github. Presentations (PPT/PDF) and PDFs are the property of their respective owners and are under the terms indicated within the presentation.