How to model split genes

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

How to model split genes

Chris Childers
Hi all,

We are currently working on ways to store and handle split genes.  The most simple case is a gene that is split across scaffolds, though the group has found models that are significantly more complicated than that (gene split across multiple scaffolds, including internal exons found on other scaffolds).

An earlier query to the listserv from 2011 gave me a lot to work with.

I put together some simple tests, and modeling a gene across two scaffolds in gff3 works and the loader completes without errors.  The main problem looks like it is how to model the ordering information so that the genes loads properly, then we can retrieve the information in a meaningful way.

I was thinking of using the feature_relationship rank column to manage subfeature sorting, based on the guide, but have not found  a tag that I can use to specify the relative rank of the subfeatures as part of the entire gene.

Is there something I can do to set the rank at load time, such as reorganizing the gff, or would it be a matter of post-hoc updating the ranks after loading?


Thanks,
Chris


------------------------------------------------------------------------------

_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: How to model split genes

Peter Cock
On Wed, Nov 5, 2014 at 8:58 PM, Chris Childers <[hidden email]> wrote:

> Hi all,
>
> We are currently working on ways to store and handle split genes.  The most
> simple case is a gene that is split across scaffolds, though the group has
> found models that are significantly more complicated than that (gene split
> across multiple scaffolds, including internal exons found on other
> scaffolds).
>
> An earlier query to the listserv from 2011 gave me a lot to work with.
>
> I put together some simple tests, and modeling a gene across two scaffolds
> in gff3 works and the loader completes without errors.  The main problem
> looks like it is how to model the ordering information so that the genes
> loads properly, then we can retrieve the information in a meaningful way.

This is what the proposed new GFF3 'Part' tag would allow - use cases
like trans-splicing or genes split between contigs do need this ordering
information.

> I was thinking of using the feature_relationship rank column to manage
> subfeature sorting, based on the guide, but have not found  a tag that I can
> use to specify the relative rank of the subfeatures as part of the entire
> gene.
>
> Is there something I can do to set the rank at load time, such as
> reorganizing the gff, or would it be a matter of post-hoc updating the ranks
> after loading?

Right now I fear you will be stuck with post-hoc ad-hoc code :(

Maybe this will get the ball rolling again on extending the GFF3 specification?

Regards,

Peter

------------------------------------------------------------------------------
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: How to model split genes

Chris Childers
Hi Peter,


I thought your thread back in 2011 was useful to help frame the problem and some potential solutions, and I like the idea of reserving a Part tag to optionally specify the order of subfeatures within a feature. 

Has the Part tag gained any traction with the gff3 specification committee?

Best,
Chris


On Thu, Nov 6, 2014 at 6:00 AM, Peter Cock <[hidden email]> wrote:
On Wed, Nov 5, 2014 at 8:58 PM, Chris Childers <[hidden email]> wrote:
> Hi all,
>
> We are currently working on ways to store and handle split genes.  The most
> simple case is a gene that is split across scaffolds, though the group has
> found models that are significantly more complicated than that (gene split
> across multiple scaffolds, including internal exons found on other
> scaffolds).
>
> An earlier query to the listserv from 2011 gave me a lot to work with.
>
> I put together some simple tests, and modeling a gene across two scaffolds
> in gff3 works and the loader completes without errors.  The main problem
> looks like it is how to model the ordering information so that the genes
> loads properly, then we can retrieve the information in a meaningful way.

This is what the proposed new GFF3 'Part' tag would allow - use cases
like trans-splicing or genes split between contigs do need this ordering
information.

> I was thinking of using the feature_relationship rank column to manage
> subfeature sorting, based on the guide, but have not found  a tag that I can
> use to specify the relative rank of the subfeatures as part of the entire
> gene.
>
> Is there something I can do to set the rank at load time, such as
> reorganizing the gff, or would it be a matter of post-hoc updating the ranks
> after loading?

Right now I fear you will be stuck with post-hoc ad-hoc code :(

Maybe this will get the ball rolling again on extending the GFF3 specification?

Regards,

Peter


------------------------------------------------------------------------------

_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: How to model split genes

Siddhartha Basu
In reply to this post by Chris Childers
Hi Chris,
How do you represent a split gene in GFF3 whose start and end are spread
across two scaffolds/contigs. Could you share an example.

thanks,
-siddhartha

On Wed, 05 Nov 2014, Chris Childers wrote:

>    Hi all,
>
>    We are currently working on ways to store and handle split genes.A  The
>    most simple case is a gene that is split across scaffolds, though the
>    group has found models that are significantly more complicated than that
>    (gene split across multiple scaffolds, including internal exons found on
>    other scaffolds).
>
>    An earlier query to the listserv from 2011 gave me a lot to work with.
>
>    I put together some simple tests, and modeling a gene across two scaffolds
>    in gff3 works and the loader completes without errors.A  The main problem
>    looks like it is how to model the ordering information so that the genes
>    loads properly, then we can retrieve the information in a meaningful way.
>
>    I was thinking of using the feature_relationship rank column to manage
>    subfeature sorting, based on the guide, but have not foundA  a tag that I
>    can use to specify the relative rank of the subfeatures as part of the
>    entire gene.
>
>    Is there something I can do to set the rank at load time, such as
>    reorganizing the gff, or would it be a matter of post-hoc updating the
>    ranks after loading?
>
>    Thanks,
>    Chris

> ------------------------------------------------------------------------------

> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema


------------------------------------------------------------------------------
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: How to model split genes

Chris Mungall-5
In reply to this post by Peter Cock
As a general approach, rather than patches on top existing formats, why
not first figure out the normative representation in FALDO, and then
when this specification is clear, work out how this maps to existing
formats (patched or unpatched) and database schemas.

In this particular case, the FALDO representation should be clear, as
each position is modeled as a distinct object, with no constraints that
two positions in the same interval need be on the same reference.

On 6 Nov 2014, at 3:00, Peter Cock wrote:

> On Wed, Nov 5, 2014 at 8:58 PM, Chris Childers
> <[hidden email]> wrote:
>> Hi all,
>>
>> We are currently working on ways to store and handle split genes.  
>> The most
>> simple case is a gene that is split across scaffolds, though the
>> group has
>> found models that are significantly more complicated than that (gene
>> split
>> across multiple scaffolds, including internal exons found on other
>> scaffolds).
>>
>> An earlier query to the listserv from 2011 gave me a lot to work
>> with.
>>
>> I put together some simple tests, and modeling a gene across two
>> scaffolds
>> in gff3 works and the loader completes without errors.  The main
>> problem
>> looks like it is how to model the ordering information so that the
>> genes
>> loads properly, then we can retrieve the information in a meaningful
>> way.
>
> This is what the proposed new GFF3 'Part' tag would allow - use cases
> like trans-splicing or genes split between contigs do need this
> ordering
> information.
>
>> I was thinking of using the feature_relationship rank column to
>> manage
>> subfeature sorting, based on the guide, but have not found  a tag
>> that I can
>> use to specify the relative rank of the subfeatures as part of the
>> entire
>> gene.
>>
>> Is there something I can do to set the rank at load time, such as
>> reorganizing the gff, or would it be a matter of post-hoc updating
>> the ranks
>> after loading?
>
> Right now I fear you will be stuck with post-hoc ad-hoc code :(
>
> Maybe this will get the ball rolling again on extending the GFF3
> specification?
>
> Regards,
>
> Peter
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema

------------------------------------------------------------------------------
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: How to model split genes

Siddhartha Basu
Have a quick glance at FALDO, however IMHO not sure how to represent
this concept. In fact, it raises more questions than answers. Is it
supposed to replace or complement GFF3. For me, it looks like it needs
background knowledge about semantic represention(rdf/tripes/turtle etc).
And i don't see any software that also interfaces with FALDO. Is it
supposed to be written by hand.

-siddhartha

On Thu, 06 Nov 2014, Chris Mungall wrote:

> As a general approach, rather than patches on top existing formats, why
> not first figure out the normative representation in FALDO, and then
> when this specification is clear, work out how this maps to existing
> formats (patched or unpatched) and database schemas.
>
> In this particular case, the FALDO representation should be clear, as
> each position is modeled as a distinct object, with no constraints that
> two positions in the same interval need be on the same reference.
>
> On 6 Nov 2014, at 3:00, Peter Cock wrote:
>
> > On Wed, Nov 5, 2014 at 8:58 PM, Chris Childers
> > <[hidden email]> wrote:
> >> Hi all,
> >>
> >> We are currently working on ways to store and handle split genes.  
> >> The most
> >> simple case is a gene that is split across scaffolds, though the
> >> group has
> >> found models that are significantly more complicated than that (gene
> >> split
> >> across multiple scaffolds, including internal exons found on other
> >> scaffolds).
> >>
> >> An earlier query to the listserv from 2011 gave me a lot to work
> >> with.
> >>
> >> I put together some simple tests, and modeling a gene across two
> >> scaffolds
> >> in gff3 works and the loader completes without errors.  The main
> >> problem
> >> looks like it is how to model the ordering information so that the
> >> genes
> >> loads properly, then we can retrieve the information in a meaningful
> >> way.
> >
> > This is what the proposed new GFF3 'Part' tag would allow - use cases
> > like trans-splicing or genes split between contigs do need this
> > ordering
> > information.
> >
> >> I was thinking of using the feature_relationship rank column to
> >> manage
> >> subfeature sorting, based on the guide, but have not found  a tag
> >> that I can
> >> use to specify the relative rank of the subfeatures as part of the
> >> entire
> >> gene.
> >>
> >> Is there something I can do to set the rank at load time, such as
> >> reorganizing the gff, or would it be a matter of post-hoc updating
> >> the ranks
> >> after loading?
> >
> > Right now I fear you will be stuck with post-hoc ad-hoc code :(
> >
> > Maybe this will get the ball rolling again on extending the GFF3
> > specification?
> >
> > Regards,
> >
> > Peter
> >
> > ------------------------------------------------------------------------------
> > _______________________________________________
> > Gmod-schema mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema

------------------------------------------------------------------------------
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: How to model split genes

Chris Childers
Hi all,

I was not familiar with FALDO, but have had an opportunity to briefly look at the FALDO specifications and pre-pubup on BioRxiv.

Has anyone involved with the FALDO project heard of the generation of a model such as what Chris M. suggested? Were there any models generated during the creation of the ontology that could be developed into a normative, or at least generalized model?

Hi Siddartha, in terms of modeling a feature split across scaffolds, our current method is to request curators to make Notes to explain how the feature is split, providing IDs for the scaffolds involved and as much additional information as possible.  Additional information may include exon numbers, sizes and locations for the model in each mRNA feature, or even a reconstructed peptide sequence if it could be generated.  I'm just testing different approaches, with the goal of not losing this valuable data.  Chado can model these features, but the trick becomes getting the data into chado correctly, then updating  requests to make sure the data is coming out right. I'm still looking into getting the data into Chado so that the relative positions are not lost or incorrect.

Peter's suggestion of a new reserved tag "Part" would go a long way to support split models.  Peter, have you heard any news about the possibility of an update to the GFF3 specification to include a Part tag?

Thanks,
Chris



On Mon, Nov 10, 2014 at 10:59 AM, Siddhartha Basu <[hidden email]> wrote:
Have a quick glance at FALDO, however IMHO not sure how to represent
this concept. In fact, it raises more questions than answers. Is it
supposed to replace or complement GFF3. For me, it looks like it needs
background knowledge about semantic represention(rdf/tripes/turtle etc).
And i don't see any software that also interfaces with FALDO. Is it
supposed to be written by hand.

-siddhartha

On Thu, 06 Nov 2014, Chris Mungall wrote:

> As a general approach, rather than patches on top existing formats, why
> not first figure out the normative representation in FALDO, and then
> when this specification is clear, work out how this maps to existing
> formats (patched or unpatched) and database schemas.
>
> In this particular case, the FALDO representation should be clear, as
> each position is modeled as a distinct object, with no constraints that
> two positions in the same interval need be on the same reference.
>
> On 6 Nov 2014, at 3:00, Peter Cock wrote:
>
> > On Wed, Nov 5, 2014 at 8:58 PM, Chris Childers
> > <[hidden email]> wrote:
> >> Hi all,
> >>
> >> We are currently working on ways to store and handle split genes.
> >> The most
> >> simple case is a gene that is split across scaffolds, though the
> >> group has
> >> found models that are significantly more complicated than that (gene
> >> split
> >> across multiple scaffolds, including internal exons found on other
> >> scaffolds).
> >>
> >> An earlier query to the listserv from 2011 gave me a lot to work
> >> with.
> >>
> >> I put together some simple tests, and modeling a gene across two
> >> scaffolds
> >> in gff3 works and the loader completes without errors.  The main
> >> problem
> >> looks like it is how to model the ordering information so that the
> >> genes
> >> loads properly, then we can retrieve the information in a meaningful
> >> way.
> >
> > This is what the proposed new GFF3 'Part' tag would allow - use cases
> > like trans-splicing or genes split between contigs do need this
> > ordering
> > information.
> >
> >> I was thinking of using the feature_relationship rank column to
> >> manage
> >> subfeature sorting, based on the guide, but have not found  a tag
> >> that I can
> >> use to specify the relative rank of the subfeatures as part of the
> >> entire
> >> gene.
> >>
> >> Is there something I can do to set the rank at load time, such as
> >> reorganizing the gff, or would it be a matter of post-hoc updating
> >> the ranks
> >> after loading?
> >
> > Right now I fear you will be stuck with post-hoc ad-hoc code :(
> >
> > Maybe this will get the ball rolling again on extending the GFF3
> > specification?
> >
> > Regards,
> >
> > Peter
> >
> > ------------------------------------------------------------------------------
> > _______________________________________________
> > Gmod-schema mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema

------------------------------------------------------------------------------
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema


------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: How to model split genes

Peter Cock
On Wed, Nov 12, 2014 at 11:32 PM, Chris Childers <[hidden email]> wrote:
> Hi all,
>
> I was not familiar with FALDO, but have had an opportunity to briefly look
> at the FALDO specifications and pre-pubup on BioRxiv.

There is work on going turning the GenBank/DDBJ/EMBL style INSDC
feature tables into RDF using FALDO, last discussed at an RDF summit
this May - and there is some FALDO discussion going on this week at
the BioHackathon 2014:

https://github.com/dbcls/rdfsummit/wiki
http://2014.biohackathon.org

One of the specific things we hope to do this week is clarify the
representation of multi-part locations in the FALDO paper (which
is on GitHub in addition to the BioRxiv preprint):

https://github.com/JervenBolleman/FALDO-paper
http://dx.doi.org/10.1101/002121
http://biorxiv.org/content/early/2014/01/31/002121

> Has anyone involved with the FALDO project heard of the generation of a
> model such as what Chris M. suggested? Were there any models generated
> during the creation of the ontology that could be developed into a
> normative, or at least generalized model?

No, but I'm not sure you actually need this if you already have an
explicit mental model of the gene structure (even if GFF3 cannot
yet capture the order information).

> Hi Siddartha, in terms of modeling a feature split across scaffolds, our
> current method is to request curators to make Notes to explain how the
> feature is split, providing IDs for the scaffolds involved and as much
> additional information as possible.  Additional information may include exon
> numbers, sizes and locations for the model in each mRNA feature, or even a
> reconstructed peptide sequence if it could be generated.  I'm just testing
> different approaches, with the goal of not losing this valuable data.  Chado
> can model these features, but the trick becomes getting the data into chado
> correctly, then updating  requests to make sure the data is coming out
> right. I'm still looking into getting the data into Chado so that the
> relative positions are not lost or incorrect.
>
> Peter's suggestion of a new reserved tag "Part" would go a long way to
> support split models.  Peter, have you heard any news about the possibility
> of an update to the GFF3 specification to include a Part tag?
>
> Thanks,
> Chris

I've not heard any recent discussion on the "Part" tag proposal for
GFF3, and am unsure how to push this further (other than forwarding
this kind of email to the song-devel mailing list which is where GFF3
discussion normally happens?).

Shall we continue this discussion there (song-devel)?

Regards,

Peter

------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: How to model split genes

Chris Childers
Hi Peter,

Thanks for the updates.  I'm glad that there is active movement on this, and I am happy to move over to song-devel.  I've just now subscribed, so it might be a while before I'm able to post over there.

Best,
Chris

On Wed, Nov 12, 2014 at 11:21 PM, Peter Cock <[hidden email]> wrote:
On Wed, Nov 12, 2014 at 11:32 PM, Chris Childers <[hidden email]> wrote:
> Hi all,
>
> I was not familiar with FALDO, but have had an opportunity to briefly look
> at the FALDO specifications and pre-pubup on BioRxiv.

There is work on going turning the GenBank/DDBJ/EMBL style INSDC
feature tables into RDF using FALDO, last discussed at an RDF summit
this May - and there is some FALDO discussion going on this week at
the BioHackathon 2014:

https://github.com/dbcls/rdfsummit/wiki
http://2014.biohackathon.org

One of the specific things we hope to do this week is clarify the
representation of multi-part locations in the FALDO paper (which
is on GitHub in addition to the BioRxiv preprint):

https://github.com/JervenBolleman/FALDO-paper
http://dx.doi.org/10.1101/002121
http://biorxiv.org/content/early/2014/01/31/002121

> Has anyone involved with the FALDO project heard of the generation of a
> model such as what Chris M. suggested? Were there any models generated
> during the creation of the ontology that could be developed into a
> normative, or at least generalized model?

No, but I'm not sure you actually need this if you already have an
explicit mental model of the gene structure (even if GFF3 cannot
yet capture the order information).

> Hi Siddartha, in terms of modeling a feature split across scaffolds, our
> current method is to request curators to make Notes to explain how the
> feature is split, providing IDs for the scaffolds involved and as much
> additional information as possible.  Additional information may include exon
> numbers, sizes and locations for the model in each mRNA feature, or even a
> reconstructed peptide sequence if it could be generated.  I'm just testing
> different approaches, with the goal of not losing this valuable data.  Chado
> can model these features, but the trick becomes getting the data into chado
> correctly, then updating  requests to make sure the data is coming out
> right. I'm still looking into getting the data into Chado so that the
> relative positions are not lost or incorrect.
>
> Peter's suggestion of a new reserved tag "Part" would go a long way to
> support split models.  Peter, have you heard any news about the possibility
> of an update to the GFF3 specification to include a Part tag?
>
> Thanks,
> Chris

I've not heard any recent discussion on the "Part" tag proposal for
GFF3, and am unsure how to push this further (other than forwarding
this kind of email to the song-devel mailing list which is where GFF3
discussion normally happens?).

Shall we continue this discussion there (song-devel)?

Regards,

Peter


------------------------------------------------------------------------------
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema