FASTA Files: The Birth of a New Pattern

FASTA Files: The Birth of a New Pattern

ยท

7 min read

In the 1980s, the technological landscape was vastly different from what we see today. The idea of computers having terabytes of storage or gigabytes of memory was, perhaps, beyond even the wildest dreams of most.

However, even then, the ever-present DNA was already subject to study. By 1977, we had the Sanger sequencing method, a revolutionary tool for decoding this complex molecule.

As DNA sequences started becoming accessible, the need for computational algorithms grew. Researchers required tools to interpret the biological sequences represented by these newly discovered letters. One common question - that still is important today - is to find among our known sequences which one is the closest to one newly identified.

In this context sequence aligners have emerged. Here I present a brief timeline:

  1. Early Days (Late 70s - Early 80s): The story begins with dynamic programming-based methods like the Needleman-Wunsch algorithm for global alignment and Smith-Waterman for local alignment, introduced in 1970 and 1981 respectively. These were computationally intensive and worked well for small sequences.

  2. Heuristic Methods (Late 80s - Early 90s): With the rise of larger sequence databases, heuristic methods like FASTA (1985) and BLAST (1990) emerged. These offered quicker, albeit approximate, alignments.

  3. Next-Generation Sequencing Era (2000s - Present): The advent of next-generation sequencing led to billions of short reads needing alignment. This prompted the development of efficient aligners like Bowtie (2009) and BWA (2010), using Burrows-Wheeler Transform and FM-index to handle the data deluge.

The FASTA software, developed in the late 80s, is faster than the ones from the late 70s. The FASTA format, which remains in use today, was specifically created for this tool.

I would like to state that the FASTA format was the pioneering standard adopted worldwide in the realm of bioinformatics. However, I'm still on the hunt for concrete references to corroborate this claim ๐Ÿ˜”

Your insights are welcome! Please feel free to leave a comment if you stumble upon relevant information!

Format Anatomy

A FASTA file is a text file used for storing either nucleotide or peptide sequences. The structure of a FASTA file is simple. It starts with a single-line description, followed by lines of sequence data.

The description line is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It's often used to hold a sequence identifier and any additional information about the sequence.

The sequence data is composed of nucleotides (A, C, G, T/U for DNA/RNA sequences) or peptides (single-letter coded amino acids for protein sequences). The sequence can be spread over multiple lines, and line breaks can be inserted at any point. However, each line typically consists of 60 or 80 characters for readability.

For instance, a simple DNA sequence in FASTA format might look like this:

>contig1
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
GACTGCAACG

The FASTA file format can accommodate both a single sequence or multiple sequences within the same file. When multiple sequences are included, each sequence is introduced by a distinct description line. The sequences themselves are then listed consecutively, allowing for efficient organization and retrieval of information.

How to represent base quality?

When working with DNA sequencing it's common to have another file together. While the FASTA file contains the actual sequence of nucleotides (i.e., the DNA sequence), the corresponding .qual file provides a quality score for each base in that sequence.

The quality scores in a .qual file are calculated during the base calling process. Here's an example:

>contig1
30 31 29 30 31 29 30 31 29 30 31
29 30 31 29 30 31 29 30 31 29 30
...

In this example, the number "30" correspond to the first base in the contig1 in the FASTA file, "31" to the second base, and so on. The actual numbers represent the quality scores of the bases, typically in Phred quality score format, which is a logarithmic scale.

Originating from the mid-90s, the Phred score also carries a rich historical significance. It remains a valuable tool in modern sequencing practices, as it provides an estimate of the quality for each base within our sequence files.

For this article, we can establish that the Phred score indicates how much can you trust the nucleotide base it is representing. Please check the definition present in Wikipedia to know more about this topic, I think it's very interesting the table that maps the phred score to the base call accuracy. It's essential when comparing different sequencing technologies (like Illumina and ONT).

"Base calling" is the process where raw signals collected by the sequencing machine are converted into a text representation of DNA. It is those A, C, T, and G's we can see in FASTA files.

We'll dive back into this topic when I write about FASTQ files.

FASTA file companions

There are several files you can see together with the FASTA file. I'll cite only those that are part of my routine:

.dict: usually related to Broad Institute tools, like Picard and GATK.

.fai: required when you want to query for specific regions in your sequence. Essentially, it serves as an index that enables programs to quickly access particular sections of the file, eliminating the need to read it in its entirety.

.bwt and others: used by the BWA tool, a software package for mapping short reads against a large reference genome.

.mmi: used by the minimap2 program, a sequence aligner for long reads.

Playing around

To be less abstract, here we can manipulate a FASTA file to get familiarized with the format.

For this example you'll need:

  • wget (or curl if you prefer)

  • samtools ( it can be installed using apt-get install samtools, if using Debian-like distro)

Obtaining FASTA files from the web

Firstly, we will download chromosomes of the human genome from Google Genome References. It's very nice how Google has organized some important resources for bioinformatics in this documentation, check it out!

# Enter a directory of your choice. I'll use ~/fasta-example
mkdir ~/fasta-example && cd ~/fasta-example

# Download chromosome 22 and the mitochondria. Only two for the sake of your disk
wget https://storage.googleapis.com/genomics-public-data/references/GRCh38/chr22.fa.gz
wget https://storage.googleapis.com/genomics-public-data/references/GRCh38/chrMT.fa.gz

It's common for sequence files to be compressed (gzip) before distributing. For this example, we will decompress and concatenate in a single file.

gunzip -c chr22.fa.gz > chromosomes.fasta
gunzip -c chrMT.fa.gz >> chromosomes.fasta
# you can now remove chr22.fa.gz and chrMT.fa.gz
rm chr22.fa.gz chrMT.fa.gz

At this point, you'll see only the chromosomes.fasta file in your directory.

Getting your sequence identifiers

Now we can start checking the identifiers of the sequences you have:

grep "^>" chromosomes.fasta 
>gi|568336002|gb|CM000684.2| Homo sapiens chromosome 22, GRCh38 reference primary assembly
>gi|113200490|gb|J01415.2|HUMMTCG Homo sapiens mitochondrion, complete genome

The program grep comes with all main Linux distributions and its general use is to check if a text file has a pattern you want. Here we are using it to print all lines in the fasta file that starts ("^") with ">" symbol. If you don't know why we are looking for this symbol please go back to Format Anatomy section of this article.

Subsetting a multi-sequence fasta

Now that we know that we have two sequences in our FASTA file and also know their identifiers we can select one, for whatever purpose you need.

samtools faidx chromosomes.fasta "gi|113200490|gb|J01415.2|HUMMTCG" > mitochondria.fasta

By doing this you'll have only the sequence representing the mitochondria in the new file called mitochondria.fasta.

You will also note a new file in your directory: the .fai that we've already presented in the section FASTA file companions.

Using samtools you can also extract a region from your FASTA file, here is an example of getting a subset of the mitochondrial sequence, from the first base until the 100th.

โฏ samtools faidx chromosomes.fasta "gi|113200490|gb|J01415.2|HUMMTCG:1-100"
>gi|113200490|gb|J01415.2|HUMMTCG:1-100
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT
CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTG

In programming languages

There are various libraries available that simplify the manipulation of biological sequences, such as those in the FASTA format.

If you are interested, check out these two libraries:

[Python] Biopython: This freely available Python library provides tools for biological computation. It includes the Bio.SeqIO module, which can be used for parsing FASTA files. You can find examples of how to use this here.

[R] SeqinR: This package, which is part of the larger Bioconductor project, can be used to manipulate FASTA files. You can find an example of how to use this package here.

There are also libraries available for a range of other programming languages including Go, Rust, and more. Feel free to explore these options based on your specific requirements and preferences.

Acknowledgment: The distinctive icon featured in the cover image was sourced from:

https://nf-co.re/developers/design_guidelines#components

Their design aesthetics greatly enhance the visual appeal of bioinformatics pipelines.

Did you find this article valuable?

Support Lucas Taniguti by becoming a sponsor. Any amount is appreciated!

ย