Friday, March 20, 2009

A call for Hippo MC1R

So this is a strange connection between two papers recently published. I so wish we had Hippo MC1R. If you don't know, MC1R is involved in pigmentation, but I am not so interested in the pigmentation of Hippo's more their phylogenetic relationship to Cetaceans. In a recent paper, or rather comment on a previous paper Jonathan H. Geisler & Jessica M. Theodor are debating between two phylogenies: ((Pig,Hippo),whale) or (Pig,(Hippo,Whale)) and are using a number of data points (fossil, morphology, and genetic) to get at the question. What's at stake, the ancestral aquatic characters being derived separately in whales and hippos or being shared derived characters inherited from a common ancestor.

They conclude (along with previous understanding as mentioned in Jerry Coyne's new book) that (Pig, (Hippo, Whale)) is the most parsimonious. This is challenged in a commentary (a response from those who originally published the (Pig,Hippo) connection). Both groups conclude that the extinct raoellids are the closest relatives to cetaceans but they disagree on the relationship between pigs, hippos and whales. They are dealing with a problem of homoplasy or convergence in the data.

Later this week I came across another paper examining Ungulate/Cetacean phylogenies using MC1R I was shocked to see no hippo in the data set or on NCBI. Someone should should totally sequence it just to satisfy my curiosity. Especially because the Nature paper is complaining of homoplasy and the MC1R paper is concluding that MC1R is great for mammalian phylogenies because it has such great resolution at so many taxonomic levels not to mention interesting information about pigmentation evolution. I know there are other studies out there but this seems like a good candidate gene to get at the pig, whale, hippo question, its unfortunate they were not sequenced along with all the artiodactyls and whales.

Monday, February 23, 2009

Lost dog genes found

Dogs can learn new tricks, or at least we can find those they already had. In a paper published recently by Derrien et al. (including Elaine Ostrander) they look at the missing genes in the dog genome compared to the other high quality published genomes.

When the dog genome was published the existing methods of orthology detection (ie 1 to 1 gene detection from other genomes) and other gene prediction methods produced an annotation with 412 fewer genes than in the rat, mouse, chimp, and human genomes (high quality genomes). There are a number of hypothetical explanations for this disparity in number. Three general hypothesis can explain this: 1)they could be Euarchontoglires specific gain, 2)they could be dog specific (or Laurasiatheria specific) gene loss, or 3)they could be an inadequacy on the part of the annotation algorithm. Obviosely these are not exclusive but Derrien et al. set out with new methods in hand to test these hypothesis for these specific genes.

Their methods: they used a very clever synteny method using the genes they do have an annotated orthology and were found in the same orientation in the genome as in all five species. As an example of how they used this information, lets say gene A, B, and C are all on the same chromosome and in the same order in the high quality genomes (ie no inversions between these genes). Gene A and C are also next to each other in the dog genome but B is one of the missing genes. Their method uses A and C as margins and search the space between them for B's homolog in dog. This gives them more statistical power because the search space is so much smaller and their a priori hypothesis of B's location makes a greater chance of finding a true positive or at least the remnants of the missing gene.

The results were very encouraging for future genome annotation endeavors. The method identified or annotated 268 dog genes that were missing (36 were in Ensembl were previously known but orthology was not) (Hypothesis 3). In addition, they found some evidence for pseudogenized genes (34 with low, and 21 with high support) (Hypothesis 2) and 37 undetected genes (Possibly Hypothesis 1). 29 were not identified using their synteny methods (Possibly Hypothesis 1). But in the end a majority of the unclassified genes are now classified.

The sort fall of such an approach comes from the specific nature of the analysis. Most annotation procedures are highly automized (we are working with about 20,000 genes) but following up after the primary run is still useful. I personally am excited about this method as it makes the dog genome annotation better.

The other great thing about this finding is that other genomes, including a relook at the high quality genomes themselves, may reveal more genes and better understanding of homology and orthology for comparative genomics.

Tuesday, January 13, 2009

Which came first the chicken or its developmental pattern?

Interestingly enough a paper addressed this question without even looking at chickens. Raox and Robinson-Rechavi examined the level of genotypic constraint on certain genes involved in development in both zebrafish and mouse.

They were testing between two hypothesis. One, presented as an extension to an idea from an early embryologist , von Baer, was that constraint between species happened earlier in development with more species specific traits coming after. The second hypothesis is described as an hourglass where early and late stages of development have low constraint and a middle period, known as the phylotypic stage has a highly constrained genotype.


One problem with this type of question, is how do you measure constraint. Evolutionary rates like dN/dS had not worked in previous studies. And other methods (like number of gene duplications) had not been used in vertebrates. So the authors used two separate types of data, EST's and microarray. EST's (expressed sequence tags) are small bits of genes expressed at a given time in a certain tissue. Microarrays, at least here, measure the presence or absence of a gene. They then compared the presence of essential genes (lethal when removed from the animal before development) to other genes to measure the ralative level of tolerance each stage had. Both sets of data in both species shared a consistent result, the ratio of essential to normal decrease linearly over developmental time in both mouse and zebrafish.


This supports the early conservation model over the hourglass model and so the chicken most likely came after its genotypic pattern of development.

Monday, November 3, 2008

HyPhy Vs PAML

This is my first post and as such, I apologize in advance. This blog is for me by me. I want to improve my science writing skills and abilities to think skeptically. I am not trying to be negative to the articles I review, I am just making an attempt to dig a little deeper and come up with scientific concerns I have. I will probably be wrong and you are free to comment and correct me. This is my learning exercise and I hope others will find it enjoyable as well. This first try, I realize is not written to the lay person. I will attempt to do that in the future, mostly I just wanted to try my hand at it with a fairly low key paper. Thanks for the patience and if you find something here great. If now, I hope I will :)

In a recent study by Cavatorta et al. (2008), they attempt to compare site models of positive molecular selection using the FEL model of HyPhy (Kosakovsky Pond an Frost, 2005) to the M2 and M8 Models of PAML (Yang, 2007). The main question they ask is which group can accurately identify resistance polymorphisms in the plan eIF4E gene with more accuracy.

This gene provides a great opportunity to answer the accuracy question for two reasons: 1) simulated data, the normal first approach, can work but usually lacks biological realism, and 2) empirical data is difficult because the correct answer is not known. eIF4E is one of the most well studied genes for resistance and so offers a background literature of functional data identifying resistance alleles in the gene. This offers a scenario in which the difficulties of empirical studies are overcome for the most part, at least in the opinion of the authors.

They do not put forth a hypothesis as to which method will do better but they do claim that the models should work.

So what are the results, after running both procedures M8 identifies 2 sites, and FEL identifies 3, both identifying site 76 with a 0.9 posterior probability cut off. All three sites identified by FEL are known to confer resistance while only site 76 identified by M8 is known. They then conclude that FEL has higher power and greater precision.

What I find most unappealing in this conclusion is the low level of stringency used. While arbitrary, 0.95 is the normal p-value cut off used in statistics. If 0.9 was selected a priori, I guess it would weaken my argument but no mention of this was made in the methods and the problem was not addressed. Both models identify site 76 with a significant p-value (M8:0.99 and FEL:0.95) and no other site is detected by either method with the normal cut off. The conclusion of precision and power is unfounded either way. The distinction between the two is on a single case (one of the problems of empirical studies) that may favor one method over the other anyway. When normal statistical stringency is used the identified sites are identical between the two methods.

I think this is a great study because it is getting at the problems of model comparison and accuracy, I am just not so sure about their conclusions or if it actually answers the question they were asking. Both methods did identify a site under selection that confers resistence but when so many other sites (16 total) should also have given signal it is concerning that both methods have such low power.