Published on November 20th, 2023
⏱ 9 min read
In this blog article, we take a look at common mistakes made during bioinformatics analysis, and offer practical solutions to minimize issues caused by these mistakes. Even seasoned professionals may sometimes overlook critical aspects of their data. This article can be used as a comprehensive reminder and checklist to ensure successful and accurate bioinformatics analyses.
Bioinformaticians should take an active role and collaborate with biologists at an early stage of the experimental design process to ensure that data gathered will serve the purpose of the experiment. Waiting until data is already generated can make analysis more challenging in the long run. In a previous post we discussed common mistakes in -omics experiments, and emphasized the importance of appropriate experimental design. For example, comparing one single sample with another single sample, makes data analysis impossible from a statistical standpoint.
Outliers can arise in many ways, e.g. due to pipetting or analytical errors, and dealing with outliers can be difficult. Data points that deviate significantly from the norm can have a significant impact on analysis. To maintain data quality and interpret the results correctly, it is important to understand the source of the outliers, and determine whether it is appropriate to remove them from your analysis.
There is a tendency to somewhat hastily remove outliers instead of accepting that they may be part of the natural variation of the data. Incidentally, this approach is not unique to bioinformaticians: biologists are also guilty of this.
BigOmics’ Omics Playground data visualization and analysis platform allows users to use both unsupervised tools and statistical methods to detect and correct outliers. For unsupervised methods, samples are presented in tSNE or UMAP where the user can visually spot outliers and remove them from the input files if needed. For the genes, if certain extreme outlier values are present in the counts, the platform detects and imputes them back to a comparable range.
Batch correction is an important step in maintaining data accuracy, and it needs to be carefully considered, especially when working with data sets from different experimental runs. For example, if a batch is confused with a cell line, a comparison is not possible.
Certain batch effects can be avoided by discussing the experimental design with the biologist beforehand. However, if you want to use additional data, external data sets, or if you want to reanalyze a public data set and compare it to your own data, then you must pay attention to batch correction and perform normalizations.
With Omics Playground, several datasets can be normalized as long as each dataset has different batch identifiers. More information on the differences between these two data sets (or batches) and details on the methods available for batch correction is available here.
Neglecting batch correction can affect your analysis and lead to incorrect conclusions. In multi-omics data analysis, different types of data are merged. This process comes with challenges. When integrating data sets, it is crucial to harmonize the data and account for differences in naming conventions and data structures to avoid compatibility issues and biased results.
Errors in gene names are common in the scientific literature. For example, Microsoft Excel spreadsheet software is known to convert gene names to dates and floating-point numbers when used with default settings. A programmatic review of leading genomics journals found that ~20% of publications with supplemental Excel gene lists contain incorrect gene name conversion. [1]
To address this issue, Microsoft recently published a blog post introducing new Excel updates that allow users to disable automatic data conversion. [2]
Therefore, be careful with gene names that resemble dates, such as Sept1, Sept9 or March1, so that they are not inadvertently changed automatically by spreadsheet programs. Instead, try opening your data files with programming languages.
As a bioinformatician, you need to be aware that biologists often use special characters, such as spaces, “-”, “*”, “+/-”, “/” etc., when naming samples that can generate errors in parsing and computation if not properly dealt with. Given the constant flood of data, regular and careful data cleaning is essential to maintain accuracy by eliminating inconsistencies and errors.
Genes (or features) are traditionally expected in rows, with samples (observations) in columns. However, this orientation may vary in other fields. Awareness of this distinction is important to correctly run the algorithm as intended. Therefore, make sure to check the input specifications of your algorithms, and ensure your data is correctly oriented. For instance, the imputation algorithms from the R package pcaMethods and missForest expect variables in the columns and observations in rows.
Keep organism-specific gene names fixed when performing the analysis. For example, do not convert mouse genes to human genes in your analysis. In mouse gene names, the first letter is capitalized and the other letters are lowercase, whereas in human gene names, all letters are capitalized.
Since this should only affect naming at the beginning and is usually automated, such an error is quite unlikely. The more important problem, however, is the use of different versions of, for example, the human or mouse genome and the attempt to compare data sets with each other, which can lead to inconsistencies due to the different genome versions.
To carry out a complete and correct data analysis and interpretation, be sure that you have all the information on the data and confounding factors. For example, individual samples may have been processed using different methods, or at different times. If such information is not passed on to the bioinformatician by the physicians or biologists, this can lead to inconsistencies and incorrect results or batch effects, especially when it comes to patient data.
Many common errors can be avoided by recognizing the above-mentioned sources of error and cooperation between biologists and bioinformaticians. BigOmics’ platform, Omics Playground, serves as a bridge between biologists and bioinformaticians, helping them to communicate better with each other. This ensures that experimental results are clearly communicated, reducing communication conflicts and improving mutual understanding.
Equally important are clean data, robustness of results and their standardization. One way to achieve all this is to use Omics Playground.
For more insights and support, try Omics Playground for yourself! Don’t miss this opportunity to learn from these common mistakes and optimize your approach to biological research.
REFERENCES
[1]
Ziemann M, Eren Y, El-Osta A. Gene name errors are widespread in the scientific literature. Genome Biol. 2016 Aug 23;17(1):177. doi: 10.1186/s13059-016-1044-7. PMID: 27552985; PMCID: PMC4994289.
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7
[2]
Microsoft Fixes Excel Feature That Forced Scientists to Rename Human Genes
https://finance.yahoo.com/news/microsoft-fixes-excel-feature-forced-151000728.html?guccounter=2