How to convert FASTQ files into read count tables using Galaxy

Published on August 2nd, 2022 by Axel Martinelli

How to convert FastQ files into read count tables

This short guide is meant for biologists with little or no coding or bioinformatic experience and no access to a bioinformatician. It is not meant to be comprehensive, but rather show the simplest way to convert FASTQ files into read counts using Galaxy for most standard NGS datasets.

Galaxy is a great website with many features, including the ability to design workflows via a visual interface that requires no programming skills. However, it does still require some familiarity with the tools to be used that many biologists may not have.

In this post, I will share my simple protocol for obtaining read count tables from FASTQ files. I should also clarify that Galaxy works well for small datasets, but can become quite inefficient as the number of samples increases. Hence, I suggest you try it only for experiments with up to 30 samples at most.

Before you start, you will need to upload your FASTQ files, as well as a reference transcriptome for the species you are working with on Galaxy. Gencode offers complete transcriptomes that include both coding and non-coding transcripts and where possible I recommend using them.

Below is the list of steps to convert FASTQ files into read count tables that I’ll discuss in this guide:

  1. Concatenate FASTQ files.
  2. Check FASTQ read quality with FastQC
  3. Use fastp to pre-process FASTQ files
  4. Estimate transcript abundance with Salmon
  5. Format the tables for downstream analysis
  6. Create Workflows (Optional)

Let’s get started!

How to convert FastQ files into read count tables

Step #1: Concatenate FASTQ files

Most likely you will have multiple FASTQ files for the same sample that need to be combined. This can be achieved by the tool “concatenate datasets”, which can be found under “General text Tools” under the “Text Manipulation” menu (Fig. 1).

The process is straightforward, but you need to be aware that you must combine R1 files with R1 files only, R2 files with R2 files only and so on.

Step 1 of how to convert fastq files into read count tables: concatenating fastq files with “concatenate datasets” in Galaxy.
Figure 1. Concatenating FASTQ files with “concatenate datasets”.

Step #2: Check FASTQ read quality with FastQC

Checking the FASTQ quality is a necessary step to properly prepare the data for analysis. For this purpose, I tend to use FastQC, which can be found under the “FASTQ Quality control” menu in “Genomic File Manipulation”.

FastQC generates read quality reports from your raw input FASTQ files that you will need to check to decide if you need to trim nucleotides off the edges of your reads. Usually, nucleotide quality deteriorates towards the end of a read and if the average phred score is below 20 (the red area in Fig. 2), you may want to consider trimming nucleotides starting from the first position below that threshold. 

Step 2 of how to convert fastq files into read count tables: FastQC base sequence quality plot.
Figure 2. FastQC base sequence quality plot.

FastQC will also tell you if there are other quality issues with your data (such as high levels of sequence duplication or abnormal adapter presence) that may affect your data analysis. Running FastQC with default settings will work in most cases, but I suggest editing the output file names.

Step #3: Use fastp to pre-process FASTQ files

Once you have been able to generate your quality reports on the FASTQ files, you will need to pre-process them (e.g. by trimming bases at the edges with low quality). For that purpose, I selected “fastp”, which you can find under the “FASTA/FASTQ” menu in “Genomic File Manipulation”.

I chose fastp because it is an all-in-one pre-processing tool, which suits my goal of generating the simplest possible workflow. When setting up fastp for your samples, you will need to specify if you are working with single end (SE) or paired end (PE) samples. You can also specify how many bases you want to trim from the ends of the reads, based on the FastQC reports. 

If you have libraries with unique molecular identifiers (UMI), which you can recognise by the presence of three FASTQ files per sample (R1, R2 and R3) instead of two, you will also need to enable UMI processing and provide the UMI length. You should be able to infer UMI length by looking at the length of the sequences in the R2 file, if you have not gotten it from your sequencing service provider.  As for the UMI location, in most cases the “per_read” option will work (Fig. 3). In that case you will only use the R1 and R3 files as input. 

You can leave other options as per default, but you may want to edit the output file names.

Step 3 of how to convert fastq files into read count tables: UMI processing options in fastp.The easiest approach is to indicate a “per_read” UMI location, while UMI length can be determined based on the R2 fastq file. A UNI prefix is not required.
Figure 3.  UMI processing options in fastp.The easiest approach is to indicate a “per_read” UMI location, while UMI length can be determined based on the R2 FASTQ file. A UNI prefix is not required.

You should then run FastQC again on the pre-processed output FASTQ files to ensure that they pass muster.

Step #4: Estimate transcript abundance with Salmon

After cleaning the FASTQ files, we are now ready to generate the actual read counts per sample. While personally I tend to use Kallisto when developing my own pipelines, I opted for Salmon on Galaxy, which you can find as “Salmon quant” under the “RNA-seq” menu in “Genomic Analysis”. 

The main reason for this choice is that with Salmon I can infer strandedness of the reads (Fig. 4), which in Kallisto I have to define before running the software. This is useful in case you don’t have that information or are not sure about how to extract it.

You still will have to define whether you are working with a SE or PE library, but otherwise you can run the program with default settings. Also, don’t forget to indicate your reference transcriptome!

Step 4 of how to convert fastq files into read count tables: The Galaxy interface for running Salmon on fastq files. Strandedness can be inferred automatically, which makes it the most user-friendly option.
Figure 4. The Galaxy interface for running Salmon on FASTQ files. Strandedness can be inferred automatically, which makes it the most user-friendly option.

Step #5: Format the tables for downstream analysis

Before you are ready to download the final file from Galaxy, you will need to do some editing.

First of all, you will need to edit the name of the column containing the raw read counts, with  “Text Transformation with sed” under “Text Manipulation”. By default, this column is named “NumReads”. To alter it, use the following command line:  

s/NumReads/sample_name/g

You should substitute “sample_name” with the sample ID you want to use. See Fig.5  for an example.

Step 5.1 of how to convert fastq files into read count tables: Renaming raw read counts column name with sed via the “Text Transformation” tool. Note the usage of sed. Advanced options can be ignored.
Figure 5. Renaming raw read counts column name with sed via the “Text Transformation” tool. Note the usage of sed. Advanced options can be ignored.

Next, the various sample read count tables need to be combined in a single file. In order to do so, you can use  “Multi-join” under “Text Manipulation” in the “General text tools” menu.  You need to indicate “1” as the common key column and “column: 5” as the column with the values you want to preserve. You will also need to indicate that the input files contain a header line and that you want to add a header line to the output file (Fig. 6). The other options can be left as per default.

Step 5.2 of how to convert fastq files into read count tables: Parameters for joining multiple read count files with “Multi-Join”. The “Common key column” indicates the transcript IDs you want to retain (column 1). Only column 5 (the raw read counts)will be retained. Also activate both “Add header line to output file” and “Input file contains a header line” options.
Figure 6. Parameters for joining multiple read count files with “Multi-Join”. The “Common key column” indicates the transcript IDs you want to retain (column 1). Only column 5 (the raw read counts)will be retained. Also activate both “Add header line to output file” and “Input file contains a header line” options.

At this point you may be able to start analysing your data with an R pipeline. However, one of the (user-friendly) options available to you is to use Omics Playground, a platform we developed for transcriptomics data analysis.

If you opt to use Omics Playground for your analysis, transcript IDs are not accepted as an input and so you will need to convert ensembl transcript IDs into gene symbols. 

First of all, you will use  “Replace Text in a specific column” under “Text manipulation” to remove numbers after the full stop in transcript IDs, otherwise conversion will not work. To do so, select column 1 and type “\.[0-9]+” in “find pattern” and leave “replace with” empty (see Fig. 7). What this regular expression does is remove the full stop and numbers after the transcript ID.

Step 5.3 of how to convert fastq files into read count tables: Editing transcript IDs with “Replace Text” in preparation for conversion to gene symbols. The replacement takes place in column 1. Note the regular expression usage under “Find Pattern”. Leave “Replace with” empty
Figure 7. Editing transcript IDs with “Replace Text” in preparation for conversion to gene symbols. The replacement takes place in column 1. Note the regular expression usage under “Find Pattern”. Leave “Replace with” empty

You will now be able to convert the ensembl transcript IDs to gene symbols (the input format for Omics Playground), using “annotateMyIDs” under “Text Manipulation”.

You will need to specify the organism you are working with, the ID type of the input file (“Ensembl Transcript”) and the output ID (Symbol) and make sure that the “File has header” option indicates “yes”, while other options can be left as default (Fig. 8). In particular, duplicates will be dealt with by Omics Playground, so you can leave them in the table.

Step 5.4 of how to convert fastq files into read count tables: Converting transcript IDs to gene symbols with “annotateMyIDs”. The “ID type” option is “Ensembl Transcripts”, while in the output columns only “SYMBOL” is checked. Also note that you need to specify the organism you are working with. In this case the choice is human, but if you work with different species, you will need to alter that.
Figure 8. Converting transcript IDs to gene symbols with “annotateMyIDs”. The “ID type” option is “Ensembl Transcripts”, while in the output columns only “SYMBOL” is checked. Also note that you need to specify the organism you are working with. In this case the choice is human, but if you work with different species, you will need to alter that.

This will produce a separate file containing the transcript IDs and the gene symbols (where available) side by side. You can now join the read counts table and the transcript ID conversion file with “Join two files” under “Text Manipulation”.  Select column 1 as the shared column and output lines appearing in both files to be merged. Also, retain the first line as a header line. Use the file with the converted IDs as the 1st file and the file with the actual read counts as the 2nd file, as shown in Fig. 9.

Step 5.5 of how to convert fastq files into read count tables: Joining the combined read count table with the transcript ID to gene symbol conversion file with “Join”. Column 1 will be chosen for both files, with output lines appearing in both files as well. Remember to check “First line is a header line” too. The other options can be left as per default.
Figure 9. Joining the combined read count table with the transcript ID to gene symbol conversion file with “Join”. Column 1 will be chosen for both files, with output lines appearing in both files as well. Remember to check “First line is a header line” too. The other options can be left as per default.

At this point, only two things remain to be done.

The first is to remove the Ensembl transcript IDs, which are now redundant.  You can use the operation “Discard” from the tool “Advanced Cut” to remove the first column (Column 1) containing the Ensembl IDs. For the other options, select “Tab” as a delimitation and “fields” for the option “cut by” (Fig. 10)

Step 5.6 of how to convert fastq files into read count tables:. Remove the first column with the Ensembl transcript IDs from the joined file with the read counts and IDs using the “Advanced Cut” tool from Galaxy. The operation to choose is “Discard”. Delimiter is by default tab and the file will be cut by fields. You will indicate Column 1 in the “List of fields”.
Figure 10. Remove the first column with the Ensembl transcript IDs from the joined file with the read counts and IDs using the “Advanced Cut” tool from Galaxy. The operation to choose is “Discard”. Delimiter is by default tab and the file will be cut by fields. You will indicate Column 1 in the “List of fields”.

Unfortunately, Galaxy produces headers for the individual sample counts that contain a lot of gibberish characters before the actual sample name, as shown in Fig. 11.

Example of the header names in the final read count table file produced by Galaxy
Figure 11. Example of the header names in the final read count table file produced by Galaxy. A rather long prefix before the actual sample names (in this case WT1 and WT2) is appended.

To fix that, we can use the tool “Replace text in a specific column”. Unfortunately, we will have to repeat that individually for each column, starting with column 2. After selecting a column, type the following line “dataset_.+_” in the “Find Pattern” box and leave the “Replace with” box empty (Fig. 12). Repeat the process with the output moving to column 3 and continue until you reach the last sample column in your file.

Step 5.7 of how to convert fastq files into read count tables: Edit the header names for each of the sample read counts using “Replace text in a specific column”. This step needs to be reiterated for each sample individually, starting with the first sample in column 2.
Figure 12. Edit the header names for each of the sample read counts using “Replace text in a specific column”. This step needs to be reiterated for each sample individually, starting with the first sample in column 2.

An alternative would be to download the file and edit the sample names with the help of a spreadsheet such as Excel. This will be quicker, but you need to be familiar with the “Find and replace” function in your spreadsheet software, as well as how regular expressions are implemented in it.

Assuming you stick to Galaxy, you can then download the output file after the final iteration of the “Replace Text in a specific column” tool and use that as the read count table input for Omics Playground. As a side note, if you plan to use the table as an input for Omics playground, do ensure your sample names do not contain spaces or regex symbols (such as “/”, “.”, “+”, etc…). If you need to combine different names or IDs, use underscore (“_”) instead.

Once you’ve uploaded the read count table, you can start visualizing your data easily. If you want to learn more about what your data would look like using Omics Playground, take a look at our case studies.

Step #6: Create Workflows (optional)

As a final note, you can also create simple workflows you can store for future use. This is not needed but can make your life easier. 

Unfortunately, several of the steps I described here are not easily concatenable within Galaxy, but you can create a workflow that passes the input FASTQ files for each sample through fastp, produces a fasQC analysis on the outcome and runs Salmon on the output files (Fig. 13).

Before running it for an individual sample you will need to indicate parameters such as the input FASTQ files and the size of the UMI sequence. You will then repeat the procedure for each sample, which can be run in parallel. 

Additionally, you can also create a separate workflow for combining and editing the sample read count files, though for iterative steps, such as editing the headers of the combined read count tables, need to be adapted for each experiment. Again, you will need to pre-define parameters such as the patterns that need replacing.

Step 6 of how to convert fastq files into read count tables: An example of a Galaxy workflow combining pre-processing via fastp, pre-processed files quality check via FastQC and transcript abundance estimation via Salmon
Figure 13. An example of a Galaxy workflow combining pre-processing via fastp, pre-processed files quality check via FastQC and transcript abundance estimation via Salmon

Conclusion: How to convert FASTQ files into read count tables using Galaxy

It usually requires some time and some coding knowledge to convert FASTQ files into read count tables. However, with Galaxy even users without a formal bioinformatic background can do so, following the steps I described above.

Was this guide helpful? Would you like us to write more about a specific topic? Let us know in the comments below!


Sign up for our Newsletter to be notified of our latest blog posts!