How to Prepare your RNA-Seq and Proteomics data for Omics Playground

Last updated on December 13th, 2024
First Published on February 22nd, 2023
⏱ 9 min read

Omics Playground is a cloud-based platform designed to provide advanced tools for the analysis of RNA-Seq and proteomics data sets. To use the platform, the omics input data needs to be in a structured format.

In this tutorial, we will guide you on how to prepare your input files. Here’s what you’ll find:

1. Input files required by Omics Playground:

Count/expression file: “counts.csv”.
Samples file: “samples.csv”
Comparisons file: “comparisons.csv”

2. Checklist for your omics data input files.

Rule #1. Leave the header of the first column empty.
Rule #2. Avoid regular expressions.
Rule #3. Don’t use spaces in your names.
Rule #4. Don’t use numeric phenotypes.

3. Where to upload your omics data in Omics Playground.

4. Conclusion.

Let’s get started!

Input files required by Omics Playground

The platform requires the input files in CSV (comma-separated-values) text format.

If you have FASTQ files, you need to convert them into read counts first. One way of doing this is using Galaxy where you can download the output file and use that as the read count table input for Omics Playground. Read more on how to convert FASTQ files into read counts here.

Once done, you’ll need to upload the following input files:

– counts.csv: count/expression file with genes/proteins on rows, and samples as columns.

– samples.csv: samples file with samples on rows, and phenotypes as columns.

– comparisons.csv: comparisons file with samples on rows, and conditions as columns (optional).

Now, let’s look at an example for each input file.

1. Count/expression file: counts.csv

The count/expression file can be prepared with any spreadsheet software (such as Excel) or through a script outputting csv format files.

The first column contains gene IDs, which can be in most formats (such as HGCN or Ensembl), but not in the Entrez number format. If you are using the latter, it will need to be converted through tools such as Syngo .

Also note that the platform will not accept transcript IDs. You will need to convert them to Gene IDs. This will result in multiple gene entries that the platform will merge.

The first row contains the sample names. The first cell of the first row should be left empty. Naming of the samples should follow the rules discussed below.

The individual cells in the other columns contain the raw or normalised read counts for the dataset. The values should always be numerical, with the exception of “NA” in case of a lack of data (Figure 1).

2. Samples file: samples.csv

The samples file contains the phenotypic information of each sample (Figure 2).

The first column contains the sample name, which must exactly match the name given in the read counts file. Note that the first cell is again kept empty.

The following columns will contain phenotypic groups. Note that the platform will not accept purely numerical values.

All phenotypes must contain at least one alphabet letter. This is done to avoid continuous values (as in the case of weight), as the platform expects discrete ranges. Having excessive numbers of phenotypic groups may also result in errors.

3. Comparisons file: comparisons.csv

This file contains the contrasts for pairwise comparisons across phenotypic groups (Figure 3).

The first column contains the phenotypes, which must exactly match the names given in the “samples.csv” file. The first cell is again empty.

The first row contains the name of the pairwise comparisons. All pairwise comparison names must follow the format shown above with the groups joined together by “_vs_” (e.g. piperaquine_vs_control).

Naming must follow the rules described below. This file can be generated through a spreadsheet software and while “-1” and “1” are the default choice for selecting sample groups, other identifiers (e.g. “control” and “treatment” can also be used).

Note that you can skip generating this file and instead rely on the “Comparisons” tab under the “Upload data” module where you can select your contrasts through the platform interface. For more information about this step of the data upload, please refer to our guide on how to upload your data into Omics Playground.

However, if you have a particularly complex datasets with multiple phenotypes and a large number of pairwise comparisons, you might want to generate this file through a script.

Checklist for your omics data input files.

One of the most common issues faced by users of Omics Playground are errors in the preparation of the input files. Omics Playground has a few rules that must be observed when naming samples and phenotypes that will cause it to reject input files or generate errors during the data processing phase.

Here are some of the most common issues and how to avoid them.

Rule 1. Leave the header of the first column empty.

When preparing the input files, do not name the first column with the samples names or IDs, but leave it empty. Here’s an example:

Rule 2. Avoid regular expressions when naming your samples or phenotypes.

The platform is based on the R programming language, which makes heavy use of a feature called “regular expressions”.

These are symbols that have a programming function beyond its common usage. Thus the full stop symbol ”.” usually indicates any alphanumeric character in a piece of code. Other regular expressions include “/”, “+”, “*”, etc…

As a good code of practice, never use any of these symbols when naming a sample or a phenotype. If you need to connect multiple elements, use underscore, “_” instead. Here’s an example:

Rule 3. Don't use spaces in your sample or phenotype names.

Empty spaces in the names of samples or phenotypes will cause the platform to throw an error message. If you need to create complex names, connect them via an underscore. Here’s an example:

Rule 4. Define intervals instead of using numeric phenotypes.

The platform does not cope with continuous numeric variables for phenotypes yet. To avoid that, our coders added a filter that flags phenotypes names “Time” or “Age” as unacceptable. The same applies for other continuous variables, such as height, weight, length, etc.

Instead, you should cluster the various numeric values into definite intervals and then name them accordingly (e.g. “Age_groups”, “Time_intervals”, etc…)

Where to upload your data in Omics Playground

You can inject your data through the interface of the platform by clicking on the “Upload new data” button located in the Welcome screen of your Omics Playground account (Figure 1).

You’ll be then prompted to select your input files (Figure 2), comparisons and desired computation options. You can learn more about the uploading process in our blog post on how to upload your data to Omics Playground.

Conclusion

Preparing input files for Omics Playground is essential to the success of your analysis.

Prepare the input files required and follow the checklist to avoid errors during the data upload. With well-prepared input files, you are now ready to take full advantage of the advanced tools provided by Omics Playground to analyze your RNA-Seq and proteomics data sets.

What’s next? If you need additional information about basic and advanced data preparation you can check the Omics Playground documentation.

Interactively analyze your RNA-Seq or proteomics data

Frequently asked questions on data preparation

Does Omics Playground support the use of raw data as input?

No, to analyze your omics data with Omics Playground, you’ll need to provide one of the following:

For RNA-Seq data: read count files
For proteomics data: abundance tables
For metabolomics data: concentration files

What is the process for inputting data into Omics Playground?

To upload your data into Omics Playground, first sign in to your account or create a new one. Then, click on ‘Upload new data’ to begin the guided upload process, which consists of five steps:

Step 1&2: Upload the expression table and sample files you prepared according to this guide.

Step 3: select your comparisons either interactively or by uploading a comparisons file.

Step 4: apply additional quality control or batch correction settings if needed.

Step 5: name and describe your dataset, and select custom computation options if necessary.

Please note that if you’re unsure what to select, default options based on best practices will be applied for both steps 4 and 5, tailored to your data type. For more detailed information on each step, refer to our comprehensive data upload guide.

I need help pre-processing my data, can BigOmics help?

BigOmics has formed partnerships with industry leaders, including Almaden Genomics and DNAnexus, to assist with omics data pre-processing and secondary analysis. These collaborations streamline your omics data analysis process, allowing you to move from raw data to interpretation more quickly. For additional guidance, feel free to contact our team here.