The Codon Adaptation Index Analyser Package - Introduction

- Author: Matteo Ramazzotti
- e-mail: matteo.ramazzotti@unifi.it
- download: At the moment only a Linux version is available for download here
- source code: all the programs are perl scripts...


The CAIAP Project (Linux version) is basically a collection of command line utilities devoted to codon usage analysis. Just a few graphical interfaces are used by now, but all the programs can be used interactively. Here you can download an abstract of the preliminary work presented to the congress "Proteine 2004" held in Viterbo (Rome, Italy).
The evaluation measure used by the program is called Codon Adaptation Index, a value based on the codon frequency of a reference dataset of genes (which can be whole genome CDS or a restricted set of highly expressed genes, obviously with different meaning...) compared to the codon frequency of a given gene. The index ranges from 0 to 1, according to the degree of adaptation that a coding sequence has with respect to the dataset. We defined two different kind of CAI: the gCAI (genomic CAI), based on whole genome CUT, and the CAI (based on a reference dataset of highly expressed genes).

The scheme below illustates the various step we propose for a correct use of our tools, a sort of "How we do it"...



Here follow a brief description of what the package does, uses and contains:

DATA SOURCE

FTP NCBI database: the database is a collection of all the genomic data associated to all the sequenced organism and contains many formatted and unformatted files. We decided to use only a part of the database for our programs, in particular we are interested in three kind of files: For completeness, it is important to notice that it exists a representant or the above file for each chromosome (or plasmid) found in each organism: the decision of keeping them separated is wise from NCBI point of view (and it is also obvious, due to base numeration and genes retrieval), but we offer a tool to fuse all the information into a single file (or, better, into three single files).
This tool is called
ChromoFusion and basically requires a listing of the .ptt files: .ffn and .faa file name must be the same of the .ptt for the program to catch them. They are fused in a new file with an indexing procedure which allow to maintain individualities and respect chromosome appeareance. The fusion occurs in the order of the input. Please do not call this a biological impair, since we just want to keep all the sequences together to speed up further analysis, we are not considering all the chromosomes and plasmids as a unique entity (we certainly know that they are subject to different rates of evolutions and different codon constraints) but we are just assuming that they are contained in the same organism.
At this point the informations are gathered together and ready for the examination: but there are still some point to take into account. First of all, one can notice, observing the .ptt and the .ffn files, that they do not respect the same notation: in .ffn files the fasta headers indicate the chomosome start and end of the sequence, together with a 'c' if they are in the lagging strand and multiple position couples if the sequences span introns-exons. In the .ptt files all that supplementary informations are tab separated and only a start-end couple is used to locate a sequence. The tool called
NCBIreformat is devoted to the reformatting of the .ffn files to make them perfectly compatile with .ptt files in terms of retrieval speed. It basically modify fasta headers in the .ffn files to make them match the entries in .ptt file.
The guy that work at NCBI are used to make revisions of their data when sequencing projects are refined: it sometimes happen that new ORFs are found in the middle of the chromosome or elsewhere. The first thing they do is to append the new protein sequence and coding sequence at the end of .faa file and .ffn files, respectively, but the annotation in the .ptt file is placed at the correct position in the chromosome. This generate a disaligning in the files and produce an annotation shift that compromise the work of our programs. For this we created
NCBIresorter which read the entries in the .ptt file and sort the .faa and the .ffn file so that they are forced to respect the annotation table. Now everything seems annotated...
At this point we need to ensure that all the informations in the files are correct in terms of DNA characters and do not present errors in coding frames. The need of perfect coding sequences rise up if you consider that DNA triplets (the codons) formed by e.g. TTN or ATR are not compatible with a codon usage analysis and we prefere to discard non-perfect sequences (though biologically important) rather than create such a confusion. A knife called
NCBIcheck is required to make all the corrections. It compares .ffn, .faa and .ptt files to check if all the informations in the three files are the same (non-DNA character, sequence lenght, chromosome position, and so on) and produce prefect versions (from the CAIAP point of view) of the files.

OBTAINING A REFERENCE SET OF GENES

There are two programs dedicated to HIGHLY BIASED GENES retrieval, an automatic one and a "manual" one.

AutoHighXP is based to an iterated process consisting of 4 phases:
  1. Generation of a global Codon Usage Table
  2. Calculation of CAI for all the genes in analysis
  3. Sorting of the genes according to CAI values
  4. Purging of the lowest CAI genes
The four steps are repeated until a subset containing 1% of the genes of the original set is produced: this is called the reference set and it is composed by the most biased genes of the genome in analysis. The user has only to decide the shortening factor (e.g. the sensivity of the algorithm, and its processing time consequently).

ManuHighXP is thought to give the user a complete control over the gene inclusion in the reference dataset. Once the file are loaded, the program produces a global Codon Usage Table and calculates the CAI of all the genes as for the automatic version. The user has to decide a CAI inclusion threshold (e.g 0.7 or 0.8): once the choice is done, the CAI values over taht threshold are displyed and added to the reference set. Initially this will result in a gross screening of genomic data, but this preliminary reference set may be used to calculate a new and more stringent Codon Usage Table. Once this is done, the new CUT is applied to the genome (the CAI values are calculated again) in order to improve the significance of the reference set: some genes are added, some other disappear. This procedure must be repeated until a set containing 1% of the initial set of genes (often, the whole genome) stabilizes. This is the true reference set.


READY TO START THE ANALYSIS

The work essentially consist in producing Codon Usage Tables and calculating CAI values. Two programs are devoted to that:
CUTabler accepts a fasta formatted gene dataset (the reference one or also the whole genome) and returns a Codon Usage Table in which the codon weight is also reported (differently from all other CUT producing programs available). The weights are used mainly by
CAIculator which calculates the CAI of a set of fasta formatted gene sequence. Both programs are capable of many other functions, but they essentialy do what I told.
Randomizer is a script for fasta formatted DNA sequences to be randomized: this allow to check the consistency of the results given by CAI calculation. Several options are available when randomizing:

WORKING WITH RESULTS

We propose some methods to easily understand and organize the resulting mass of data that can be generated by our programs: we have develpoed a results sorted called
CAIsorter that is able to resort the output of CAI calculating tools (which are naturally ordered according to fasta formatted input of sequences on which the CAIs have been calculated, e.g by chromosome position, if a raw .ffn file is used) according to a number of criteria, such as chromosome position, CAI value, annotation, COG code, COG class and so on, both in ascending or descending order.
geneCLAST is thought to cluster gene results according to a preliminary functional classification. It basically consists in an automation of BLAST calls on all the coding sequences present in a .ffn file. The coding sequences are automatically translated and BLASTp-ted against a database (build-up on COG, Cluster of Ortologue Genes, Database sequences) of coding sequences with verified annotations and functional pre-clustering. Each BLAST result is parsed and the information of the first hit is taken as a reference for the coding sequence in analysis.
CAInorm is useful when inter-genomic comparison is the goal of the analysis. Since CAI values are naturally based on the set of highly biased genes, which may (well, probably is sure) change from organism to organism, the CAIs are not directly comparable. What CAInorm does is to calculate the average CAI value and the standard deviation for all the results, and to express the CAI value in terms of 'number of standard deviations from the average'. In this way there is clear the proportion of genes that sufficently deviate from the average and have to be considered of particular interest
CAIdist is used to characterize the population of CAI values of a whole genome. An interval is asked for (e.g. 0.01) and a new table containing the frequencies of the CAI values from 0 to 1 stepping the interval is created. Besides, the program asks for a sampling number according to which, for each interval, an ideal normally distributed CAI value should be. This is used to evaluate (with an R^2 value) haw well the genomic CAI distribution fits a normal distribution.
Chromoscan is used to find out islets of common expression among contiguous genes: once a CAI values is estabilished, if the data are sorted by chromosome position, it is possible to make inferences on topological aspects of gene expression. For example it is known that in fast growing organisms much highly expressed genes, such as ribosomal ones, are located near the replication origin (due to multiple chromosome duplication strategy) and this improves the gene copy number and in the end the protein amount. With this tool is possible to define a gene window over which to make operations (e.g. summatory, productory, scaling etc.) on CAI values, in order to exhalt local characteristics for what concernes codon bias and possibly gene expression. By varying the window size it is possible to identify regions of common codon usage and and to define the number of the gens contained in: we may locate, for example, ribosomal proteins operons (possible operons!) whose centered value (after filtration by Chromoscan) is much bigger than preceeding and following windows values.
GeneLocator is still in progress and it is thought to put in graph the spatial distribution of a set of genes (e.g the refset or the genes over a defined threshold) in the chromosome, to locate the islets of high expression or some chromosome peculiarities. Besides, it should be clear if the gene of interests is on the leading (+) or in the lagging(-) strand.


GATHERING ALL TOGETHER

At the moment we are developing a Graphical User Interface (GUI) that should be able to performe all the operations described above, by launching the specific tools in an ordered and intuitive way.
A snapshot of the GUI is here below




VALIDATION

The Codon Adaptation Index is a consistent measure of how much a sequence fits to the reference dataset used and, in the end, to gene expression. The major problem is the looping method according to which one can extract highly expressed genes from whole genome data according only to the coding sequences and their CAI.
To validate the system we used a comparison with the reference dataset produced by A. Carbone et al. with a similar system (which was validated with microarray data).
The optimal number of genes in the reference dataset should the 1% of the whole genome (e.g. 40 genes from a 4000 genes genome), as proposed by Sharp & Li when the Codon Adaptation Index was introduced. There follows some graphs in which the expression values (CAI) for all the genes of some genomes are evaluated according to the different reference dataset and to various gene number composition.

Bacillus Subtilis (4109 genes)

Haemophilus influenzae (1608 genes)

Escherichia coli (4308 genes)


As it can be easily observed from colours, the A. Carbone et al. dataset and the one generated with our programs are very similar ad therefore they give very similar expression (CAI) values if the dataset contains 1% of the genes of the genome. This means that our method is sufficiently confident.
The next image is a gene expression (CAI) distribution for all the genomes indicated above, showing how different genome expression can be and how useful can be such an approach expecially when microarray data and other experimental evidences miss or are too much expensive or time consuming.



Go to Donatella Degl'Innocenti Bioinfo Page