
Bioinfomatic basics: where to get data
What is bioinfomatics
Bioinformatics is a type of biology that studies the patterns of genomic sequences of DNA, RNA, and proteins in animal, bacterial or viral organisms. Bioinfomatics discoveres how genomic codes produce functional output in cells. How different layers of genetic codes relate to each other. How are these codes responsible for development, age and disease? If these are new concepts to you, I suggest reading the NIH Human Genome Introduction to learn how these molecules are composed of repeating strucutures producing a genetic codes
Genomic infomation is handy for understanding the inner working of cells, but also to intentify organisms in a mixed population. This is especially relevant to describing microbiomes which are composed of hundreds of different species of microbes.
Where does genomic data come from?
Simply, genomics data comes from labs. Genomic data is derived either from experimental trials, like measuring the mRNA after a chemical treatment, or directly from patients samples, which can be used to understand the genetic contribution to pathology.
What’s so exciting about bioinformatics is you don’t need a lab to access genomic data. Many datasets are publically available online repositories, such as
-NCBI Genomic Expression Omnibus
-International Cancer Genome Consortium
-Cancer Cell Line Encyclopedia
-ENCODE, Encyclopedia of DNA Elements
The repositories are either government funded or privately funded, but they all are free to download data. Where data is stored largely depends on the size of the data. It can be stored either locally, cloud computing or deposited into large data storing initatives, like in the list above. Where data is stored depends on whether data is published and/or contains sensitive patient information.
How to store and open genomic data
Data is stored either in file, database or an online application (API). A few typical file formats are plain text, tab-delimited format (.txt), binary files (.BAM), web-based markup language (.xml), and many more.
Sequence files are imported into a platform for visualization and analysis.
If you want to visualize sequences, you can use a genome browser like ENSEMBL, UCSC Genome Browser, NCBI Genome Data Viewer, and 3D Genome Browser for exploring chromatin interactions. Genome browsers are very useful, but limited in analytical power. For deeper bioinfomatic analysis, one must make use of programming languages like R, Python, and Unix. In this blog, I’ll be describing operations in R.
The future of genomic data storage
The average human genome is about 100GB of storage space and typically larger because of neccessary associated files, like experimental metadata and sequence corrections. By 2025, researchers predict that 100 million - 2 billion human genomes will be sequenced.. The future of genomic data storage is an active field of development with many technology companies hedging their bets on cloud computing.