Published on October 18th, 2023
⏱ 5 min read
Scientists frequently make mistakes in experimental design and data analysis. The good news is that many of these issues can be prevented through awareness of the pitfalls. In this article, we pinpoint the most common mistakes and offer practical solutions to help avoid them or minimize their impact.
One of the most common mistakes is attempting statistical analysis with only one sample compared to another single sample. This error is often not even based on a lack of statistical knowledge, but on the belief that “I’ll just take one sample from the disease group and one from the control group, and I can work some statistical magic.” While it is possible to do some exploratory analysis without replicates (e.g. using EdgeR), this is no substitute for the availability of replicates and it is strongly discouraged for the analysis and interpretation of datasets. To get meaningful results, a minimum of three samples is needed. The more samples you have, the higher the statistical power of the experiment.
It’s important to note that this limitation isn’t just about cost-savings but also about the practical challenges of sample acquisition. For example, in clinical scenarios involving the collection of brain tissue from patients, more than one sample might simply not be an option. If you have a single sample, be aware of the limited information it provides and do not draw unwarranted conclusions.
When investigating specific phenotypes such as the likelihood of developing lung cancer, numerous factors come into play, including smoking habits, age, BMI and even unrelated factors, such as the time and place where samples were collected and sequenced. Some of these factors cannot be quantified, making it impossible to account for all variables, so there is always going to be some element of noise.
Understanding the influence of various factors, such as genetic variation, environmental and technical factors, is therefore important, so as not to draw unwarranted conclusions, especially when dealing with human patient data, which is inherently noisy.
The use of appropriate controls can reduce the impact of variables, but proper sample preparation is also critical, as contamination problems can occur unnoticed.
Biological research spans various stages, from cell lines to organoids, congenic mice, and human patients. Each level introduces new levels of variability.
As individuals age, unique mutations occur. Even among monozygotic twins this can lead to significant variations (e.g., only one twin will develop cancer in 62% of cases ) Genetic diversity profoundly impacts gene expression, with diet and environmental factors playing a significant role.
Furthermore, environmental factors, such as diet or living conditions, greatly influence gene expression. A good example is shown by the association between schizophrenia and DNA methylation in monozygotic twins, which highlights how environmental factors can lead to different outcomes even with identical genetic backgrounds. 
To avoid such errors, biologists should aim for larger sample numbers whenever possible. All relevant variables affecting patients should be controlled where possible, and if this is not possible, noted. In clinical cases with limited sample availability, it is important to realize that low sample numbers will limit the power of statistical analyses.
When working with both in vitro and animal models, controlling for environmental variables such as temperature and diet in experiments is critical to reducing noise. This becomes even more challenging in patient studies, where such variables cannot be avoided.
As a general rule, work with animals of the same gender (males or females only) to avoid sex-related differences. Females are often less aggressive, making it easier to cage them together and handle them, but some studies will inevitably require the use of male mice, where the potential for cage aggression will increase. Also note that housing mice together can cause stress, which may affect immune responses. On the other hand, mice are social animals and isolating them can also cause stress. Therefore, it is usually ideal to balance the number of animals in each cage to avoid overcrowding but also prevent “lonely mouse syndrome”. 
Generally, fewer environmental controls lead to more data variables: Cell lines may require a minimum of three samples; mice should require a minimum of five or even ten samples due to greater variability.
In patient studies, aiming for a larger sample size is desirable. For clinical trials, a minimum of several hundred participants is desirable  and that can be used as a starting point for omics experiments, but more complex studies should aim for thousands or more patients to identify outliers and maintain statistical significance. However, as previously mentioned, collecting such a large number of samples is not always feasible and often studies will be published on a handful of samples that should always come with strong caveats regarding the interpretability of the results. Nonetheless, be aware that large sample size experiments face potential pitfalls, particularly when it comes to detecting spurious correlations . Ironically, too much data can be as bad as too little data, unless all possible variables can be accounted for, which is still rarely, if ever the case. Hence, careful data selection and measurement, among other aspects, are crucial for meaningful correlations.
Scientists naturally distinguish between causality and correlation when interpreting data. Biologists formulate hypotheses rooted in their biological knowledge and must rigorously test them, rather than merely confirming preconceptions. Bioinformaticians, in contrast, adopt a data-driven approach. Collaborative discussions balancing these perspectives enhance experimental design and analysis. A single individual rarely excels in both lab work and bioinformatics, as these fields require distinct skill sets and extensive training.
Often trends are observed without certainty about causation or correlation, prompting the need for further experiments. Here, biologists and bioinformaticians collaborate effectively, combining hypothesis-driven and data-driven approaches to test causal relationships. Conflicts between these methods can be productive, as data may contradict initial hypotheses or reveal noise in datasets.
Thus, constructive discussions on experimental design and replication yield valuable insights, enhancing research quality and benefiting from diverse expertise.
By recognizing the importance of sample size, understanding variability in different research models, and balancing hypothesis-driven and data-driven approaches through collaboration between biologists and bioinformaticians, many common mistakes can be avoided.
BigOmics’ platform, Omics Playground, serves as a bridge between biologists and bioinformaticians and helps them to collaborate better between each other. When a bioinformatician can show a biologist compelling visualizations and analyses, and can translate complex bioinformatic findings in a way that the biologist can understand, effective communication is facilitated. This ensures that experimental results are communicated clearly, reducing communication conflicts and improving mutual understanding.
Don’t miss the opportunity to learn from these common mistakes and optimize your approach to biological research. Get more insights into your experimental results, try Omics Playground for yourself!
Unlock the full potential of your RNA-seq and proteomics data.
Harvard T. H. Chan School of Public Health. Twin study estimates familial risks of 23 different cancers
Castellani, Christina A et al. “DNA methylation differences in monozygotic twin pairs discordant for schizophrenia identifies psychosis related genes and networks.” BMC medical genomics vol. 8 17. 6 May. 2015, doi:10.1186/s12920-015-0093-1
Ye, Xiaobu et al. “Suspected Lonely Mouse Syndrome as a Cage Effect in a Drug Safety Study.” Journal of veterinary medicine vol. 2018 9562803. 9 May. 2018, doi:10.1155/2018/9562803
Sakpal TV. Sample size estimation in clinical trial. Perspect Clin Res. 2010 Apr;1(2):67-9. PMID: 21829786; PMCID: PMC3148614.
Calude, C.S., Longo, G. The Deluge of Spurious Correlations in Big Data. Found Sci 22, 595–612 (2017). https://doi.org/10.1007/s10699-016-9489-4