Published on August 2nd, 2022 by Axel Martinelli
This short guide is meant for biologists with little or no coding or bioinformatic experience and no access to a bioinformatician. It is not meant to be comprehensive, but rather show the simplest way to convert FASTQ files into read counts using Galaxy for most standard NGS datasets.
Galaxy is a great website with many features, including the ability to design workflows via a visual interface that requires no programming skills. However, it does still require some familiarity with the tools to be used that many biologists may not have.
In this post, I will share my simple protocol for obtaining read count tables from FASTQ files. I should also clarify that Galaxy works well for small datasets, but can become quite inefficient as the number of samples increases. Hence, I suggest you try it only for experiments with up to 30 samples at most.
Before you start, you will need to upload your FASTQ files, as well as a reference transcriptome for the species you are working with on Galaxy. Gencode offers complete transcriptomes that include both coding and non-coding transcripts and where possible I recommend using them.
Below is the list of steps to convert FASTQ files into read count tables that I’ll discuss in this guide:
Let’s get started!
Most likely you will have multiple FASTQ files for the same sample that need to be combined. This can be achieved by the tool “concatenate datasets”, which can be found under “General text Tools” under the “Text Manipulation” menu (Fig. 1).
The process is straightforward, but you need to be aware that you must combine R1 files with R1 files only, R2 files with R2 files only and so on.
Checking the FASTQ quality is a necessary step to properly prepare the data for analysis. For this purpose, I tend to use FastQC, which can be found under the “FASTQ Quality control” menu in “Genomic File Manipulation”.
FastQC generates read quality reports from your raw input FASTQ files that you will need to check to decide if you need to trim nucleotides off the edges of your reads. Usually, nucleotide quality deteriorates towards the end of a read and if the average phred score is below 20 (the red area in Fig. 2), you may want to consider trimming nucleotides starting from the first position below that threshold.
FastQC will also tell you if there are other quality issues with your data (such as high levels of sequence duplication or abnormal adapter presence) that may affect your data analysis. Running FastQC with default settings will work in most cases, but I suggest editing the output file names.
Once you have been able to generate your quality reports on the FASTQ files, you will need to pre-process them (e.g. by trimming bases at the edges with low quality). For that purpose, I selected “fastp”, which you can find under the “FASTA/FASTQ” menu in “Genomic File Manipulation”.
I chose fastp because it is an all-in-one pre-processing tool, which suits my goal of generating the simplest possible workflow. When setting up fastp for your samples, you will need to specify if you are working with single end (SE) or paired end (PE) samples. You can also specify how many bases you want to trim from the ends of the reads, based on the FastQC reports.
If you have libraries with unique molecular identifiers (UMI), which you can recognise by the presence of three FASTQ files per sample (R1, R2 and R3) instead of two, you will also need to enable UMI processing and provide the UMI length. You should be able to infer UMI length by looking at the length of the sequences in the R2 file, if you have not gotten it from your sequencing service provider. As for the UMI location, in most cases the “per_read” option will work (Fig. 3). In that case you will only use the R1 and R3 files as input.
You can leave other options as per default, but you may want to edit the output file names.
You should then run FastQC again on the pre-processed output FASTQ files to ensure that they pass muster.
After cleaning the FASTQ files, we are now ready to generate the actual read counts per sample. While personally I tend to use Kallisto when developing my own pipelines, I opted for Salmon on Galaxy, which you can find as “Salmon quant” under the “RNA-seq” menu in “Genomic Analysis”.
The main reason for this choice is that with Salmon I can infer strandedness of the reads (Fig. 4), which in Kallisto I have to define before running the software. This is useful in case you don’t have that information or are not sure about how to extract it.
You still will have to define whether you are working with a SE or PE library, but otherwise you can run the program with default settings. Also, don’t forget to indicate your reference transcriptome!
Before you are ready to download the final file from Galaxy, you will need to do some editing.
First of all, you will need to edit the name of the column containing the raw read counts, with “Text Transformation with sed” under “Text Manipulation”. By default, this column is named “NumReads”. To alter it, use the following command line:
You should substitute “sample_name” with the sample ID you want to use. See Fig.5 for an example.
Next, the various sample read count tables need to be combined in a single file. In order to do so, you can use “Multi-join” under “Text Manipulation” in the “General text tools” menu. You need to indicate “1” as the common key column and “column: 5” as the column with the values you want to preserve. You will also need to indicate that the input files contain a header line and that you want to add a header line to the output file (Fig. 6). The other options can be left as per default.
At this point you may be able to start analysing your data with an R pipeline. However, one of the (user-friendly) options available to you is to use Omics Playground, a platform we developed for transcriptomics data analysis.
If you opt to use Omics Playground for your analysis, transcript IDs are not accepted as an input and so you will need to convert ensembl transcript IDs into gene symbols.
First of all, you will use “Replace Text in a specific column” under “Text manipulation” to remove numbers after the full stop in transcript IDs, otherwise conversion will not work. To do so, select column 1 and type “\.[0-9]+” in “find pattern” and leave “replace with” empty (see Fig. 7). What this regular expression does is remove the full stop and numbers after the transcript ID.
You will now be able to convert the ensembl transcript IDs to gene symbols (the input format for Omics Playground), using “annotateMyIDs” under “Text Manipulation”.
You will need to specify the organism you are working with, the ID type of the input file (“Ensembl Transcript”) and the output ID (Symbol) and make sure that the “File has header” option indicates “yes”, while other options can be left as default (Fig. 8). In particular, duplicates will be dealt with by Omics Playground, so you can leave them in the table.
This will produce a separate file containing the transcript IDs and the gene symbols (where available) side by side. You can now join the read counts table and the transcript ID conversion file with “Join two files” under “Text Manipulation”. Select column 1 as the shared column and output lines appearing in both files to be merged. Also, retain the first line as a header line. Use the file with the converted IDs as the 1st file and the file with the actual read counts as the 2nd file, as shown in Fig. 9.
At this point, only two things remain to be done.
The first is to remove the Ensembl transcript IDs, which are now redundant. You can use the operation “Discard” from the tool “Advanced Cut” to remove the first column (Column 1) containing the Ensembl IDs. For the other options, select “Tab” as a delimitation and “fields” for the option “cut by” (Fig. 10)
Unfortunately, Galaxy produces headers for the individual sample counts that contain a lot of gibberish characters before the actual sample name, as shown in Fig. 11.
To fix that, we can use the tool “Replace text in a specific column”. Unfortunately, we will have to repeat that individually for each column, starting with column 2. After selecting a column, type the following line “dataset_.+_” in the “Find Pattern” box and leave the “Replace with” box empty (Fig. 12). Repeat the process with the output moving to column 3 and continue until you reach the last sample column in your file.
An alternative would be to download the file and edit the sample names with the help of a spreadsheet such as Excel. This will be quicker, but you need to be familiar with the “Find and replace” function in your spreadsheet software, as well as how regular expressions are implemented in it.
Assuming you stick to Galaxy, you can then download the output file after the final iteration of the “Replace Text in a specific column” tool and use that as the read count table input for Omics Playground. As a side note, if you plan to use the table as an input for Omics playground, do ensure your sample names do not contain spaces or regex symbols (such as “/”, “.”, “+”, etc…). If you need to combine different names or IDs, use underscore (“_”) instead.
Once you’ve uploaded the read count table, you can start visualizing your data easily. If you want to learn more about what your data would look like using Omics Playground, take a look at our case studies.
As a final note, you can also create simple workflows you can store for future use. This is not needed but can make your life easier.
Unfortunately, several of the steps I described here are not easily concatenable within Galaxy, but you can create a workflow that passes the input FASTQ files for each sample through fastp, produces a fasQC analysis on the outcome and runs Salmon on the output files (Fig. 13).
Before running it for an individual sample you will need to indicate parameters such as the input FASTQ files and the size of the UMI sequence. You will then repeat the procedure for each sample, which can be run in parallel.
Additionally, you can also create a separate workflow for combining and editing the sample read count files, though for iterative steps, such as editing the headers of the combined read count tables, need to be adapted for each experiment. Again, you will need to pre-define parameters such as the patterns that need replacing.
It usually requires some time and some coding knowledge to convert FASTQ files into read count tables. However, with Galaxy even users without a formal bioinformatic background can do so, following the steps I described above.
Was this guide helpful? Would you like us to write more about a specific topic? Let us know in the comments below!
Sign up for our Newsletter to be notified of our latest blog posts!