The genome is a far from equilibrium complex system. Spontaneous properties often emerge in these systems through unit interactions. Such self-organisation examples are seen in many systems and are widely reflected in power-law distributions of their unit sizes. In linguistics, Zipf's law is the most common case of power law and takes effect in a variety of natural and anthropogenic systems. The ubiquity and diversity of phenomena, that exhibit Zipf law, point to some underlying universal properties that are yet to be discovered. Menzerath – Altmann’s law is another case of power law that was observed in the human languages and states that the larger the whole is, the smaller the parts are. Initially it was applied to human languages at the level of words - syllables and was recently applied to the genome on the genes – exons level. In the human genome it appeared that the larger the gene was in number of exons, the more the exons tend to diminish in number of nucleotides. With further investigations it was shown that the accordance with Menzerath – Altmann’s law weakens with increasing transcriptional complexity, with the appearance of alternative exons and in genes with high sequence conservation. We extended this sort of analysis by assessing the conformity of Menzerath – Altmann’s law in human gene families.
Human exons were downloaded from Ensembl under the Ensembl Genes 79 annotation scheme. Data consisted of exon coordinates, transcript ID and family ID for all human exons. We performed analysis, using the R and Perl programming languages, at the transcript level because transcripts contain the basic information for the production of one protein product. Points of the plots represent the mean size of exons of all transcripts that contain the same number of exons. Error bars represent the standard error of the mean. We performed these calculations at the logarithmic scale for each family that consisted of more than 100 exons and 9 transcripts.
he correlation of mean exon size with the number of the exons of the transcripts was very diverse between gene families. There were families that followed distributions that were compatible with the ones expected under the Menzerath – Altmann law, while others followed the reversed distribution and a significant number that didn’t show strong correlation between mean exon size and exon number. Probably this is based on the different evolutionary events that occurred in these families. We noticed that Menzerath – Altmann’s law applies to families in which expansion - mostly through duplication - and conservation are mediated.
In cases where conservation prevails expansion, the accordance to Menzerath – Altmann’s law is weak. On the contrary, gene families whose mean exon size increases with the number of exons of transcripts were found to have a wide range of functions and strong influence of duplication accompanied with low conservation. Also some families had their transcripts grouped in distinct clusters based on their exon number. These families consist of genes that encode protein domains from different ancestors. So the clustered distribution is due to the differential expression of these domains through the alternative splicing.
Overall, our results are suggestive of multiple modes of conformity with the Menzerath-Altmann law between different protein families, which may been seen as a strong reflection of their different evolutionary trajectories.
See here for the poster.