bug in gff bulk loader (+ fix?)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

bug in gff bulk loader (+ fix?)

Andrew Farmer
Hi all (mostly Scott, I guess)-
we've been using the gmod_bulk_load_gff3.pl script to load gene
annotations and have
noticed that for some of our genomes, a small number of duplicate
polypeptides are created;
that is, we end up with a chado in which some mRNA features have >1
autogenerated polypeptide
in a derives_from relationship with them. For a while, we took the
simple expedient and just did some
ad hoc post-process deletion of the redundant polypeptides (which were
causing trouble for some of
our downstream work). But I think we've managed to isolate what is
causing the problem and
have a possible fix (although someone who understands the logic of the
CDS handling better than I do
should probably verify that it isn't causing other problems I haven't
noticed yet).

In a nutshell, the problem seems to occur when features from different
genes have the same fmax
value (even when located on different source features). For example, I
found I was able to reproduce
the duplication problem with genes having these two components, which
both have an fmax of 18272:
Phavu|Chr05     phytozomev10    five_prime_UTR  18020   18272 .      
+       .
Phavu|scaffold_31       phytozomev10    CDS     18008   18272 .      
+       0
(will try to attach the full gene models for these as a test case if you
want to try to reproduce it)

I believe the problem is in the process_CDS method of the
Bio::GMOD::DB::Adapter module; it is using
a query based on min(fmax) to get the next feature set to process, and
although it seems to process
the features that it receives from multiple genes correctly (ie creating
a polypeptide for each, in the final step it only deletes the data for
one of the genes from the tmp table, leaving the other one to be
processed again. I think the following
patch solves it, although there are likely other solutions that may be

<     my $grandparent;
 >     my %unique_grandparents;
<         $grandparent  = $$feat_row{ grandparent_id };
 >         my $grandparent  = $$feat_row{ grandparent_id };
 >         $unique_grandparents{$grandparent} = 1;
<     $sth->execute($grandparent);
<     $dbh->commit;
 >     foreach my $grandparent (keys %unique_grandparents) {
 >         $sth->execute($grandparent);
 >         $dbh->commit;
 >     }

let me know if you need more info than I've provided, or if you think
I've gotten something muddled in
assessing the situation.


Andrew Farmer
(Legume Information System)

...all concepts in which an entire process is semiotically concentrated
elude definition; only that which has no history is definable.

Friedrich Nietzsche

Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls.
Build a bridge from your legacy apps to the future.
Gmod-schema mailing list
[hidden email]

Pvulgaris_218_v1.0.gene_exons.gff3.mini (2K) Download Attachment