Published on February 22nd, 2023 by BigOmics
Last updated on June 13, 2023
⏱ 9 min read
Omics Playground is a cloud-based platform designed to provide advanced tools for the analysis of RNA-Seq and proteomics data sets. To use the platform, the RNA-seq and proteomics input data need to be in a structured format.
In this tutorial, we will guide you on how to prepare your input files. Here’s what you’ll find:
Let’s get started!
You can inject your data through the interface of the platform by uploading it from the “Upload data” panel located under the Home Panel module (Figure 1).
The platform requires the input files in CSV (comma-separated-values) text format.
If you have FASTQ files, you need to convert them into read counts first. One way of doing this is using Galaxy where you can download the output file and use that as the read count table input for Omics Playground. Read more on how to convert FASTQ files into read counts here.
Once done, you’ll need to upload the following input files:
– counts.csv: count/expression file with genes on rows, and samples as columns.
– samples.csv: samples file with samples on rows, and phenotypes as columns.
– contrasts.csv: contrast file with samples on rows, and conditions as columns.
Now, let’s look at an example for each input file.
The count/expression file can be prepared with any spreadsheet software (such as Excel) or through a script outputting csv format files.
The first column contains gene IDs, which can be in most formats (such as HGCN or Ensembl), but not in the Entrez number format. If you are using the latter, it will need to be converted through tools such as Syngo.
Also note that the platform will not accept transcript IDs. You will need to convert them to Gene IDs. This will result in multiple gene entries that the platform will merge.
The first row contains the sample names. The first cell of the first row should be left empty. Naming of the samples should follow the rules discussed below.
The individual cells in the other columns contain the raw or normalised read counts for the dataset. The values should always be numerical, with the exception of “NA” in case of a lack of data.
The samples file contains the phenotypic information of each sample.
The first column contains the sample name, which must exactly match the name given in the read counts file. Note that the first cell is again kept empty.
The following columns will contain phenotypic groups. Note that the platform will not accept purely numerical values.
All phenotypes must contain at least one alphabet letter. This is done to avoid continuous values (as in the case of weight), as the platform expects discrete ranges. Having excessive numbers of phenotypic groups may also result in errors.
This file contains the contrasts for pairwise comparisons across phenotypic groups.
The first column contains the phenotypes, which must exactly match the names given in the “samples.csv” file. The first cell is again empty.
The first row contains the name of the pairwise comparisons. All pairwise contrasts names must follow the format shown above with the groups joined together by “_vs_” (e.g. piperaquine_vs_control).
Naming must follow the rules described below. This file can be generated through a spreadsheet software and while “-1” and “1” are the default choice for selecting sample groups, other identifiers (e.g. “control” and “treatment” can also be used).
Note that you can skip generating this file and instead rely on the “contrasts” tab under the “Upload data” module. Check this video tutorial (min 1:30) to see how you can select your contrasts through the platform interface.
However, if you have a particularly complex datasets with multiple phenotypes and a large number of pairwise comparisons, you might want to generate this file through a script.
One of the most common issues faced by users of Omics Playground are errors in the preparation of the input files. Omics Playground has a few rules that must be observed when naming samples and phenotypes that will cause it to reject input files or generate errors during the data processing phase.
Here are some of the most common issues and how to avoid them.
When preparing the input files, do not name the first column with the samples names or IDs, but leave it empty. Here’s an example:
The platform is based on the R programming language, which makes heavy use of a feature called “regular expressions”.
These are symbols that have a programming function beyond its common usage. Thus the full stop symbol ”.” usually indicates any alphanumeric character in a piece of code. Other regular expressions include “/”, “+”, “*”, etc…
As a good code of practice, never use any of these symbols when naming a sample or a phenotype. If you need to connect multiple elements, use underscore, “_” instead. Here’s an example:
Empty spaces in the names of samples or phenotypes will cause the platform to throw an error message. If you need to create complex names, connect them via an underscore. Here’s an example:
The platform does not cope with continuous numeric variables for phenotypes yet. To avoid that, our coders added a filter that flags phenotypes names “Time” or “Age” as unacceptable. The same applies for other continuous variables, such as height, weight, length, etc.
Instead, you should cluster the various numeric values into definite intervals and then name them accordingly (e.g. “Age_groups”, “Time_intervals”, etc…)
Preparing input files for Omics Playground is essential to the success of your analysis.
Prepare the input files required and follow the checklist to avoid errors during the data upload. With well-prepared input files, you are now ready to take full advantage of the advanced tools provided by Omics Playground to analyze your RNA-Seq and proteomics data sets.
What’s next? If you need additional information about basic and advanced data preparation you can check the Omics Playground documentation.
You can also check this video tutorial on how to upload you dataset through the platform interface or start analyzing your RNA-Seq or proteomics data with Omics Playground here: get started with Omics Playground.