Current Tools and Methods in Genomic Analysis

Working in genomic data science or research? This survey is still open, so let us know your thoughts on current tools and methods in genomic analysis!

Introduction

In today’s genomic era comprehensive analysis of genomic data is becoming increasingly popular in academic and clinical research contexts [Conesa and Mortazavi, 2014]. This development increases the need for more sophisticated tools and methods for acquiring, distributing and analysing genomic data [de Brevern et al, 2015].

In this scope the comprehensive annotation and analysis of nucleotide polymorphisms becomes its own distinct discipline in the field of genomics. Efforts to not only associate risk variants to disease, but also identify causal variants for pathological conditions and simply to better understand the genomic landscape in large datasets, already exist and will likely increase in the future.

At the same time, the range and number of software tools to help overcome these challenges is increasing at a fast pace and scientists find themselves in the fortunate position of being able to choose from highly specialised methods specific to their use cases. Also in the field of variant analysis a solid ecosystem of tools has evolved, with many of its tools having both strengths and weaknesses [de Brevern et al., 2015].

Besides the right choice of tools, how to find and choose relevant genomic data is also a popular question among scientists. Acquisition of data relevant to research questions is important and can have a large impact on the validity of any analysis. It is acknowledged that data reuse and reanalysis between genomics studies should reduce false positive results and increase reliability and the chances of making novel discoveries. However, the processes researchers use to find data to power their research are often ad-hoc and inefficient, and furthermore, many researchers still do not use external data sources to power their research at all [Tempest et al].

“Searching for relevant data is haphazard and usually involves a general web search, visiting one or more databases, searching through a journal database and/or tracing the data referenced in a published article, or a word-of-mouth search “

Aim of the survey

In this survey we aimed to get a better understanding of the current software tools used by bioinformaticians and data scientists working in the field of genomics, as well as the scientific questions asked when analysing variant data. Additionally, we were interested in the survey’s participants’ genomic data search and access habits and whether our recipients behave similarly or differently from those surveyed in Tempest et al.

We sent out a short web questionnaire generated with typeform via e-mail to a selected user-base including nine questions in total.

The results presented below are derived from business professionals and researchers working in genomics, with their work field ranging from bioinformatics, biology, data science and software development.


Results

Life scientists use a wide range of different web and desktop applications to analyse their genomic data.

When asked for the type of software used in the analysis of data, both desktop applications as well as web tools enjoy popularity among the target group. A vast amount of the scientists (73%) uses both types of software tools to tackle genomic data analysis, with only a small portion of bio scientists using web tools alone (20%) and even less relying on desktop-based tools exclusively (7%).

Pie Chart: Which type of tools do you use for genomic data analysis? Answers: 73% - A mix of both. 20% - mostly web-based tools. 7 % - mostly desktop applications

The list of tools is astonishingly widespread and includes an amount of 39 different tools and web services which are used among the participants to handle genomic data. Most popular are very multifunctional tools like the UC Santa Cruz Genome Browser (UCSC), Ensembl, the Integrated Genome Viewer (IGV), but also the statistics tool R for creating and reusing data analysis packages. Further down the list we find a rich set of software tools made for different levels of data visualisation and manipulation, ranging from mere command line tools for data parsing to complex web applications orchestrating full data analysis workflows.

In sum the ecosystem of tools present and in use in the life sciences community can be described as very diverse. This was deduced from the multitude of different research needs that survey participants are confronted with, but also from the fact that many of the tools mentioned are a product of rather enclosed academic project communities which are sometimes only known to an exclusive user base.

Reference data for variant analysis is still kept simplistic

Looking at the scientific questions that drive survey participants’ research - specifically the analysis of variants - we were both interested in which kind of reference data set is favourably used and which parameters drive variant comparison.

Variants are investigated in comparison to a reference genome build by most of the researchers (63%) and only rarely in contrast to a control group (19%). Variant data derived from paired samples as in e.g. cancer research are preferred by only 13% of survey participants. It is still unclear if these answers are mostly related to researchers’ interest or if obstacles in obtaining more specific reference data sets influence the scientists’ choices.

Which type of reference dataset do you usuallt use (or would you like to use if the data was available) to compare your variant data to? Answers: 63% - Reference Genome; 19% - Variant data of control group; 13% - control samples in paired tumor/normal samples; 5% - Other

Regarding the analysis of variant frequencies we found that numbers are derived from either the overall population (37.5%) or from ethnic subpopulations (25%). In this survey the option of analysing variant frequencies in subpopulations, that are not of ethnic nature but any other parameter (e.g. blood type, genotype, etc.), was also offered, but interestingly was not chosen by any of the survey participants. This indicates that research interests in more specific subpopulations aren’t very common yet. Also, we found that a fair amount of the participants are not investigating variant frequency in populations at all (37.5%) which implies that current genomic research does not focus on population studies alone.

When obtaining information on the frequency of a variant of interest, which population subset do you refer to? Answers: 37.5% - Frequency of variants in overall population; 37.5% - My research interest does not involve population studies; 25% - frequency of variants in an ethnic subpopulation; 0% - Frequency of variants in another subpopulation despite ethnicity

Most researchers do, via various means, use external sources of genomic data to power their research

When asked if they used externally sourced data for their research, the majority of respondents (87%) said they did. This is an increase on the 74% of researchers surveyed in Tempest et al., that said they accessed data from repositories. However, it is important to note that the difference in wording of these questions means that our recent survey would include researchers that used externally sourced data not from repositories (ie. from collaborators).

Do you use externally sourced data for your research? Answers: 87% - Yes; 13% - No

The predominant method by which the recipients find this data is via repositories/databases. Though through publications, using the NCBI publication search engine PubMed, and Google to search for data is also popular. Those repositories mentioned by name are EBI, GEO, ArrayExpress and GenBank. Additionally, some recipients source external data by asking collaborators or forming collaborations. The steps in the process of finding external data that researchers struggle with most include: finding the right kind of data; sifting through lots of data or publications; dealing with poor data descriptions; associating phenotypic/functional data with the raw data; procuring the data; and not having enough time.

How do you find this data? 33% - Databases; 24% - Papers (PubMed); 24% - Google; 14% - Collaboration; 5% - Word of Mouth

Of the 13% of recipients who said they did not use externally sourced data for their research, the only cited reason for not doing so was that they “did not know how to access it”. Furthermore, all of the ‘No’ responders said that if they could find external data more easily, then they would consider using it in their research. This suggests that one of the major blockers to these individuals using externally sourced data is that it is too hard to find.

These results reinforce our understanding that it can be very difficult for researchers to find and gain access to external sources of genomic data. There is a lot of data out there but researchers either do not know where to look, struggle to wade through the masses of inconsistently formatted data to find what they are looking for, or struggle to gain access to the data once they have found it. Therefore there is clearly a need to consolidate the metadata associated with genomics data into one format that can be searched through via one, internationally known, portal. To address this pressing problem of a lack of data discoverability, Repositive has built an online platform (repositive.io) to provide a single-point entry to search public genomic data repositories [Kovalevskaya et al., 2016].

Conclusion

This survey has been a great opportunity to gain a deeper insight into the most popular tools and methods used in current genomics research. We found that besides a couple of tools with greater popularity like Ensembl or the UCSC Genome Browser there is a long list of utils with a small and distinct user base making the landscape of currently used software tools big and diverse. This might reflect a diverse user base in the bioinformatics community with a wide range of needs in software functionalities.

In the specific scope of variant analysis, reference genomes are still the most used type of reference data set, although other types of reference data (e.g. control group data in genome-wide association studies or paired tissue samples in cancer research) are becoming an ever increasing topic in genomics research [Wu et al., 2010, Beroukhim et al., 2006]. To clarify if the choice of reference genomes in variant analysis is mostly done by preference or by necessity due to missing, more insightful data sets, is a question still left unanswered and should be challenged in a future user survey.

Population studies are an aspect of the majority of survey participants’ research, but only half of those involving ethnic subpopulation. An interesting outlook for another survey would be to ask participants for their motivation on doing more specified subpopulation studies, involving other parameters than ethnicity to create data subsets, in assumption that the required data was available for their research.

Additionally, we gained greater insight into the usage of external data resources by genomics researchers in their everyday work. Predominantly the recipients use sources of external data to power their research, and in greater numbers than found by Tempest et al., in 2014. They mainly find this data by searching in databases/repositories, on google or by reading publications sourced through PubMed. However, there are many steps in the workflow of finding external data that researchers struggle with. All of the researchers who don’t use external data to power their research said they would do so if it was more easy to find the data.

In the future it would be interesting to look into greater detail at what the perceived or experienced blockers are to researchers trying to source external data to power their research. Before we can overcome the hurdle of helping researchers find data, we first have to help them understand why it is important to use external data in their research. Furthermore, it will be important to gain a greater understanding of the struggles and blockers users face when trying to find data and how Repositive might be able to solve those problems.

The survey is still open - if you want to participate and have your say, fill out our short 5 - 10 mins long survey here. Your responses will give crucial insight into the requirements of a web based tool which helps professionals like yourself to analyse genomic data and support their decision processes in a research context.

After completing the survey, enter your email address and get live updates on this survey and be entered into a draw to win a Repositive T-shirt!

About the authors

Jessica is a web developer at the Earlham Institute, Norwich, UK and contributor to the BioJS project - the leading JavaScript based component library to manipulate and visualise biological data on the web. She wants to understand which needs drive life scientists in their choice of genomics data analysis tools to create better, more intuitive and user-friendly web experiences.

Charlotte is the Product Manager for Repositive and is interested in understanding why researchers currently use external sources of genomic data to power their research and how they go about finding this data. Repositive is a social enterprise that is building an online platform that indexes genomic data stored in repositories and thus enables researchers to search for and access a range of human genomic data sources through a single portal.

Sources

[1] Conesa A, Mortazavi A. The common ground of genomics and systems biology.BMC Systems Biology. 2014;8(Suppl 2):S1. doi:10.1186/1752-0509-8-S2-S1.

[2] De Brevern AG, Meyniel J-P, Fairhead C, Neuvéglise C, Malpertuy A. Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies. BioMed Research International. 2015;2015:904541. doi:10.1155/2015/904541.

[3] Wu MC, Kraft P, Epstein MP, et al. Powerful SNP-Set Analysis for Case-Control Genome-wide Association Studies. American Journal of Human Genetics. 2010;86(6):929-942. doi:10.1016/j.ajhg.2010.05.002.

[4] Beroukhim R, Lin M, Park Y, Hao K, Zhao X, Garraway LA, Fox EA, Hochberg EP, Mellinghoff IK, Hofer MD, et al. Inferring loss-of-heterozygosity from unpaired tumors using high-density oligonucleotide SNP arrays. PLoS Comput Biol. 2006;2:e41. doi: 10.1371/journal.pcbi.0020041.

Tempest A. van Schaik, Nadezda V. Kovalevskaya, Elena Protopapas, Hamza Wahid, Fiona G.G. Nielsen. The need to redefine genomic data sharing: A focus on data accessibility. Applied and Translational Genomics. 2014. http://dx.doi.org/10.1016/j.atg.2014.09.013

Kovalevskaya NV, Whicher C, Richardson TD, Smith C, Grajciarova J, Cardama X, et al. (2016) DNAdigest and Repositive: Connecting the World of Genomic Data. PLoS Biol 14(3): e1002418. doi:10.1371/journal.pbio.1002418