bug in gff bulk loader (+ fix?)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

bug in gff bulk loader (+ fix?)

Andrew Farmer
Hi all (mostly Scott, I guess)-
we've been using the gmod_bulk_load_gff3.pl script to load gene
annotations and have
noticed that for some of our genomes, a small number of duplicate
polypeptides are created;
that is, we end up with a chado in which some mRNA features have >1
autogenerated polypeptide
in a derives_from relationship with them. For a while, we took the
simple expedient and just did some
ad hoc post-process deletion of the redundant polypeptides (which were
causing trouble for some of
our downstream work). But I think we've managed to isolate what is
causing the problem and
have a possible fix (although someone who understands the logic of the
CDS handling better than I do
should probably verify that it isn't causing other problems I haven't
noticed yet).

In a nutshell, the problem seems to occur when features from different
genes have the same fmax
value (even when located on different source features). For example, I
found I was able to reproduce
the duplication problem with genes having these two components, which
both have an fmax of 18272:
Phavu|Chr05     phytozomev10    five_prime_UTR  18020   18272 .      
+       .
ID=Phavu|Phvul.005G000300.1.v1.0.five_prime_UTR.1;Parent=Phavu|Phvul.005G000300.1.v1.0;pacid=27148751
Phavu|scaffold_31       phytozomev10    CDS     18008   18272 .      
+       0
ID=Phavu|Phvul.L002600.1.v1.0.CDS.1;Parent=Phavu|Phvul.L002600.1.v1.0;pacid=27164917
(will try to attach the full gene models for these as a test case if you
want to try to reproduce it)

I believe the problem is in the process_CDS method of the
Bio::GMOD::DB::Adapter module; it is using
a query based on min(fmax) to get the next feature set to process, and
although it seems to process
the features that it receives from multiple genes correctly (ie creating
a polypeptide for each, in the final step it only deletes the data for
one of the genes from the tmp table, leaving the other one to be
processed again. I think the following
patch solves it, although there are likely other solutions that may be
preferable:

4100c4100
<     my $grandparent;
---
 >     my %unique_grandparents;
4103c4103,4104
<         $grandparent  = $$feat_row{ grandparent_id };
---
 >         my $grandparent  = $$feat_row{ grandparent_id };
 >         $unique_grandparents{$grandparent} = 1;
4225,4226c4226,4229
<     $sth->execute($grandparent);
<     $dbh->commit;
---
 >     foreach my $grandparent (keys %unique_grandparents) {
 >         $sth->execute($grandparent);
 >         $dbh->commit;
 >     }


let me know if you need more info than I've provided, or if you think
I've gotten something muddled in
assessing the situation.

regards

Andrew Farmer
(Legume Information System)




--
...all concepts in which an entire process is semiotically concentrated
elude definition; only that which has no history is definable.

Friedrich Nietzsche


------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls.
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema

Pvulgaris_218_v1.0.gene_exons.gff3.mini (2K) Download Attachment