Guidance on loading multiple assemblies, feature naming, etc.

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Guidance on loading multiple assemblies, feature naming, etc.

Stephen Ficklin-3
Hi Olen,

I was thinking you could group, say sequences, using the featureprop table and use the relationship term 'contained_in' from the relationship ontology to group them. The value of the record could be the group or "collection" name.  But now that I think on it more there is no index on the 'value' field on the property table so it would not be a quick thing to query....

Stephen

-----Original Message-----
From: Olen Vance Sluder Jr [mailto:[hidden email]]
Sent: Thursday, April 21, 2011 12:47 PM
To: Stephen Ficklin
Cc: Naama Menda; GMOD Schema List
Subject: Re: [Gmod-schema] Guidance on loading multiple assemblies, feature naming, etc.

On Thu, Apr 21, 2011 at 10:50 AM, Stephen Ficklin wrote:
> I would consider each assembly an “analysis” so would it be best to use the
> analysisfeature table to group features together by their respective
> analysis (i.e. assembly)?

Thanks, Stephen. I had not considered using the Companalysis module. I
will look more closely at it as well as Andy's recommendation of the
Library module.

> But in general… I’ve not been in on this topic until now, so forgive me if
> I’m stating something already brought up.  But for organizing items (e.g.
> properties, sequences organisms) into groups would it be best to use the
> relationship ontology?  Specifically the ‘contained_in’ term?

It hadn't come up yet, so a good point to me. Is that not the ontology
currently used to define the type of relationship in existing "linking
tables", e.g., feature_relationship?
--
Olen


------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve
application availability and disaster protection. Learn more about boosting
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Guidance on loading multiple assemblies, feature naming, etc.

Olen Vance Sluder Jr
In reply to this post by Scott Cain
On Thu, Apr 21, 2011 at 11:46 AM, Scott Cain wrote:
> The Chado adaptor will always be slower than SeqFeature::Store, unless
> I write a sophisticated view that creates and stores BioPerl
> SeqFeature objects in the Chado database (that is, completely recreate
> the SeqFeature::Store schema in Chado).  The question for any given
> user is whether they want to trade that slowness for running off the
> same database that is used for other tasks.  For some people, that
> answer is yes (and sometimes because they can throw enough computing
> power at the problem that it doesn't matter), and for others its no.

I'm in a situation where I can throw hardware at it. I want to
minimize data redundancy and tighten integration among the tools,
i.e., Apollo and GBrowse (initially). I have not explored the
internals of either data adaptor, but wouldn't the speed largely be a
factor of how big a region one was trying to view? For example, a few
thousand or tens of thousands of base pairs should be relatively
snappy, while a whole chromosome would likely keep the the database
server busy for a while.
--
Olen

------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve
application availability and disaster protection. Learn more about boosting
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Guidance on loading multiple assemblies, feature naming, etc.

Scott Cain
The response time for the Chado adaptor does scale linearly with the
number of features it has to return, so it does scale roughly with the
size of the region as well.  The SeqFeature::Store database does scale
with size, but it is nearly flat.  That's because the bulk of the time
in the response is building BioPerl objects, which is why
SeqFeature::Store stores serialized BioPerl objects.  Also, both the
Chado adaptor and SeqFeature::Store support a "summary view" option
that kicks in when zoomed out above an admin-defined threshold, where
the display switches from showing individual glyphs to a feature
density plot.

Scott


On Thu, Apr 21, 2011 at 12:59 PM, Olen Vance Sluder Jr <[hidden email]> wrote:

> On Thu, Apr 21, 2011 at 11:46 AM, Scott Cain wrote:
>> The Chado adaptor will always be slower than SeqFeature::Store, unless
>> I write a sophisticated view that creates and stores BioPerl
>> SeqFeature objects in the Chado database (that is, completely recreate
>> the SeqFeature::Store schema in Chado).  The question for any given
>> user is whether they want to trade that slowness for running off the
>> same database that is used for other tasks.  For some people, that
>> answer is yes (and sometimes because they can throw enough computing
>> power at the problem that it doesn't matter), and for others its no.
>
> I'm in a situation where I can throw hardware at it. I want to
> minimize data redundancy and tighten integration among the tools,
> i.e., Apollo and GBrowse (initially). I have not explored the
> internals of either data adaptor, but wouldn't the speed largely be a
> factor of how big a region one was trying to view? For example, a few
> thousand or tens of thousands of base pairs should be relatively
> snappy, while a whole chromosome would likely keep the the database
> server busy for a while.
> --
> Olen
>
> ------------------------------------------------------------------------------
> Benefiting from Server Virtualization: Beyond Initial Workload
> Consolidation -- Increasing the use of server virtualization is a top
> priority.Virtualization can reduce costs, simplify management, and improve
> application availability and disaster protection. Learn more about boosting
> the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve
application availability and disaster protection. Learn more about boosting
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Guidance on loading multiple assemblies, feature naming, etc.

Yuri Bendana-3
In reply to this post by Scott Cain
Hi Scott,

This might help.  I made a simple nd_collection table that is not officially part of NatDiv that links to stock,experiment and project.  It could be generalized to be part of a collection module.  The module should also include a collection_relationship table to relate one collection to another.  Here's an excerpt of the DDL:

CREATE TABLE nd_collection (
    nd_collection_id serial PRIMARY KEY,
    type_id integer NOT NULL REFERENCES cvterm (cvterm_id) ON DELETE CASCADE INITIALLY DEFERRED 
);

CREATE TABLE nd_collectionprop (
    nd_collectionprop_id serial PRIMARY KEY,
    nd_collection_id integer NOT NULL REFERENCES nd_collection ON DELETE CASCADE INITIALLY DEFERRED,
    type_id integer NOT NULL REFERENCES cvterm (cvterm_id) ON DELETE CASCADE INITIALLY DEFERRED,
    value text,
    units_id integer REFERENCES cvterm (cvterm_id) ON DELETE SET NULL INITIALLY DEFERRED,
    rank integer NOT NULL DEFAULT 0,
    CONSTRAINT nd_collectionprop_c1 UNIQUE (nd_collection_id,type_id,rank)
);

CREATE TABLE nd_collection_stock (
    nd_collection_stock_id serial PRIMARY KEY,
    nd_collection_id integer NOT NULL REFERENCES nd_collection ON DELETE CASCADE INITIALLY DEFERRED,
    stock_id integer NOT NULL REFERENCES stock ON DELETE CASCADE INITIALLY DEFERRED,
    CONSTRAINT nd_collection_stock_c1 UNIQUE (nd_collection_id,stock_id)
);

yuri

On Thu, Apr 21, 2011 at 9:51 AM, Scott Cain <[hidden email]> wrote:
Hi Naama,

Interesting idea--how would you see that working?  Would there be a
"collections" module that would consist of any "thingies" from
somewhere else in the database (identified by table and table_id) and
a relationship term tying them to the collection_id that groups them
together?

Scott


On Thu, Apr 21, 2011 at 11:43 AM, Naama Menda <[hidden email]> wrote:
> there's been a lot of talking on this list on how to  group different
> objects.
> Projects, evidences, gene families, organisms, properties, and now sequence.
>
> Probably a good idea to start a new chado module for grouping objects.
>
> -Naama
>
>
> On Thu, Apr 21, 2011 at 11:17 AM, Olen Vance Sluder Jr <[hidden email]> wrote:
>>
>> Well, after a little more study of the Sequence Ontology (SO) and the
>> Chado schema, I decided I can't precisely represent what I want in
>> Chado currently, i.e., a SO sequence_collection or reference_genome.
>> Though Don's recommendation has its attractions, I will go the route
>> of encoding the reference genome in the uniquename of the feature per
>> Rob as it gets me around the bug in Apollo too.
>>
>> This does lead me to a question about extending the schema to store
>> information about a SO sequence_collection. Would such an extension
>> make sense as part of the existing Sequence module or a separate
>> module, e.g., Sequence_Collection? I'm leaning towards the latter and,
>> if it has utility beyond me, I can put together some DDL or an ERD for
>> comment.
>> --
>> Olen
>>
>>
>> On Mon, Apr 18, 2011 at 4:30 PM, Olen Vance Sluder Jr wrote:
>> > On Mon, Apr 18, 2011 at 1:50 PM, Robert Buels wrote:
>> >> On 04/18/2011 11:10 AM, Don Gilbert wrote:
>> >>> I've not done this, but think it would work with chado to handle
>> >>> multiple assemblies,
>> >>> of one species:  give each assembly a new organism/species ID/name.
>> >>>  Since all the
>> >>> chado features are tagged by organism id, that would let you go a
>> >>> simple route of
>> >>> not changing chromosome, feature IDs, but only the organism id.
>> >>
>> >> However, the downside of this is that you then have fake species in
>> >> your
>> >> organism table.  If you wanted to, for example, query for a summary of
>> >> what features you have for a given organism, you would have to figure
>> >> out what those organism rows would be.
>> >>
>> >> Perhaps you would need some sort of way of grouping organisms in that
>> >> case.  I seem to remember that idea being brought up before.
>> >
>> > Don,
>> >
>> > I considered that approach based upon a comment in the GMOD wiki on
>> > the Organism module <http://gmod.org/wiki/Chado_Organism_Module>:
>> >
>> > "If a particular strain or subspecies is to be represented, this is
>> > appended onto the species name."
>> >
>> > I could consider each assembly a new "strain", but then I reconsidered
>> > for exactly Rob's reasoning. That's when I started to dig into the SO
>> > and came across CV terms for assemblies, golden_paths, etc., got
>> > really confused, and wrote my original email.
>> >
>> > The feature naming issue is also related to a bug in Apollo accessing
>> > Chado that was discussed at the recent GMOD spring training and
>> > subsequent email with Ed Lee where Apollo does not properly handle
>> > multiple organisms (or assemblies) with non-unique feature names.
>> >
>> > I'm unsure how this will all work as far as configuring GBrowse2, but
>> > I haven't dug into that yet.
>> > --
>> > Olen
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> Benefiting from Server Virtualization: Beyond Initial Workload
>> Consolidation -- Increasing the use of server virtualization is a top
>> priority.Virtualization can reduce costs, simplify management, and improve
>> application availability and disaster protection. Learn more about
>> boosting
>> the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
>> _______________________________________________
>> Gmod-schema mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
>
> ------------------------------------------------------------------------------
> Benefiting from Server Virtualization: Beyond Initial Workload
> Consolidation -- Increasing the use of server virtualization is a top
> priority.Virtualization can reduce costs, simplify management, and improve
> application availability and disaster protection. Learn more about boosting
> the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     <a href="tel:216-392-3087" value="+12163923087">216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve
application availability and disaster protection. Learn more about boosting
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema


------------------------------------------------------------------------------
Fulfilling the Lean Software Promise
Lean software platforms are now widely adopted and the benefits have been
demonstrated beyond question. Learn why your peers are replacing JEE
containers with lightweight application servers - and what you can gain
from the move. http://p.sf.net/sfu/vmware-sfemails
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Guidance on loading multiple assemblies, feature naming, etc.

Siddhartha Basu
In reply to this post by Scott Cain
Hi Scott,
In summary,  is there is any one particular reason that chado range queries are expected
to be slower than SeqFeature::Store model.
* Database design of chado,  is the slowness is at db level.
* Application level,  storing serialized bioperl object in the db
  itself makes it fast.
* Or is it the nature of feature ranges itself. Since they are
  non-contiguos you do need algorithm like overlapping tree or NCList
  to have faster range queries. Do SeqFeature::Store employes and range
  tree.
* Can the chado range queries be made faster, materialized
  views/geospatial indexes.

thanks,
-siddhartha


On Thu, 21 Apr 2011, Scott Cain wrote:

> The response time for the Chado adaptor does scale linearly with the
> number of features it has to return, so it does scale roughly with the
> size of the region as well.  The SeqFeature::Store database does scale
> with size, but it is nearly flat.  That's because the bulk of the time
> in the response is building BioPerl objects, which is why
> SeqFeature::Store stores serialized BioPerl objects.  Also, both the
> Chado adaptor and SeqFeature::Store support a "summary view" option
> that kicks in when zoomed out above an admin-defined threshold, where
> the display switches from showing individual glyphs to a feature
> density plot.
>
> Scott
>
>
> On Thu, Apr 21, 2011 at 12:59 PM, Olen Vance Sluder Jr <[hidden email]> wrote:
> > On Thu, Apr 21, 2011 at 11:46 AM, Scott Cain wrote:
> >> The Chado adaptor will always be slower than SeqFeature::Store, unless
> >> I write a sophisticated view that creates and stores BioPerl
> >> SeqFeature objects in the Chado database (that is, completely recreate
> >> the SeqFeature::Store schema in Chado).  The question for any given
> >> user is whether they want to trade that slowness for running off the
> >> same database that is used for other tasks.  For some people, that
> >> answer is yes (and sometimes because they can throw enough computing
> >> power at the problem that it doesn't matter), and for others its no.
> >
> > I'm in a situation where I can throw hardware at it. I want to
> > minimize data redundancy and tighten integration among the tools,
> > i.e., Apollo and GBrowse (initially). I have not explored the
> > internals of either data adaptor, but wouldn't the speed largely be a
> > factor of how big a region one was trying to view? For example, a few
> > thousand or tens of thousands of base pairs should be relatively
> > snappy, while a whole chromosome would likely keep the the database
> > server busy for a while.
> > --
> > Olen
> >
> > ------------------------------------------------------------------------------
> > Benefiting from Server Virtualization: Beyond Initial Workload
> > Consolidation -- Increasing the use of server virtualization is a top
> > priority.Virtualization can reduce costs, simplify management, and improve
> > application availability and disaster protection. Learn more about boosting
> > the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
> > _______________________________________________
> > Gmod-schema mailing list
> > [hidden email]
> > https://lists.sourceforge.net/lists/listinfo/gmod-schema
> >
>
>
>
> --
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at scottcain dot net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
>
> ------------------------------------------------------------------------------
> Benefiting from Server Virtualization: Beyond Initial Workload
> Consolidation -- Increasing the use of server virtualization is a top
> priority.Virtualization can reduce costs, simplify management, and improve
> application availability and disaster protection. Learn more about boosting
> the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema

------------------------------------------------------------------------------
Fulfilling the Lean Software Promise
Lean software platforms are now widely adopted and the benefits have been
demonstrated beyond question. Learn why your peers are replacing JEE
containers with lightweight application servers - and what you can gain
from the move. http://p.sf.net/sfu/vmware-sfemails
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Guidance on loading multiple assemblies, feature naming, etc.

Scott Cain
Hi siddhartha,

All of the above, really.  Several things slow it down, and some could be addressed:

1. The design of chado as a normalized data warehouse is good for safely storing data, but it means that the data adaptor needs to do queries that join multiple tables and/or make return trips to the database for more information.  A set of materialized views that mimic the Bio::DB::GFF would go a long way towards speeding it up.

2.  As I mentioned before, a significant amount of processing time is devoted to creating BioPerl feature objects. This would be difficult to over come short of writing a tool that would take all of the features in the feature table and create and store those objects. If such a tool existed (essentially a sophisticated materialized view tool) it would make the chado adaptor blindingly fast (and might allow it to be subclassed off the existing SeqFeature::Store adaptor--a major win).

3. Range queries using standard B-tree indexes are not as fast as what can be had with GIS indexes. In older versions of postgres chado had a set of functions to make use of R-tree indexes but I don't think those are supported anymore. Updating those functions would help too.  SeqFeature::Store does not make use of these types of indexes but I would be interested in implementing them in the Pg adaptor to see if it got noticeably faster.

It's a shame that I didn't write this out a month ago; I could have used this as a idea for google summer of code student, and it probably would have been better than the other chado related ideas I wrote.

Scott



--
Scott Cain, Ph. D.
scott at scottcain dot net
Ontario Institute for Cancer Research
http://gmod.org/
216 392 3087

Sent from my phone.

On Apr 21, 2011, at 9:50 PM, Siddhartha Basu <[hidden email]> wrote:

> Hi Scott,
> In summary,  is there is any one particular reason that chado range queries are expected
> to be slower than SeqFeature::Store model.
> * Database design of chado,  is the slowness is at db level.
> * Application level,  storing serialized bioperl object in the db
>  itself makes it fast.
> * Or is it the nature of feature ranges itself. Since they are
>  non-contiguos you do need algorithm like overlapping tree or NCList
>  to have faster range queries. Do SeqFeature::Store employes and range
>  tree.
> * Can the chado range queries be made faster, materialized
>  views/geospatial indexes.
>
> thanks,
> -siddhartha
>
>
> On Thu, 21 Apr 2011, Scott Cain wrote:
>
>> The response time for the Chado adaptor does scale linearly with the
>> number of features it has to return, so it does scale roughly with the
>> size of the region as well.  The SeqFeature::Store database does scale
>> with size, but it is nearly flat.  That's because the bulk of the time
>> in the response is building BioPerl objects, which is why
>> SeqFeature::Store stores serialized BioPerl objects.  Also, both the
>> Chado adaptor and SeqFeature::Store support a "summary view" option
>> that kicks in when zoomed out above an admin-defined threshold, where
>> the display switches from showing individual glyphs to a feature
>> density plot.
>>
>> Scott
>>
>>
>> On Thu, Apr 21, 2011 at 12:59 PM, Olen Vance Sluder Jr <[hidden email]> wrote:
>>> On Thu, Apr 21, 2011 at 11:46 AM, Scott Cain wrote:
>>>> The Chado adaptor will always be slower than SeqFeature::Store, unless
>>>> I write a sophisticated view that creates and stores BioPerl
>>>> SeqFeature objects in the Chado database (that is, completely recreate
>>>> the SeqFeature::Store schema in Chado).  The question for any given
>>>> user is whether they want to trade that slowness for running off the
>>>> same database that is used for other tasks.  For some people, that
>>>> answer is yes (and sometimes because they can throw enough computing
>>>> power at the problem that it doesn't matter), and for others its no.
>>>
>>> I'm in a situation where I can throw hardware at it. I want to
>>> minimize data redundancy and tighten integration among the tools,
>>> i.e., Apollo and GBrowse (initially). I have not explored the
>>> internals of either data adaptor, but wouldn't the speed largely be a
>>> factor of how big a region one was trying to view? For example, a few
>>> thousand or tens of thousands of base pairs should be relatively
>>> snappy, while a whole chromosome would likely keep the the database
>>> server busy for a while.
>>> --
>>> Olen
>>>
>>> ------------------------------------------------------------------------------
>>> Benefiting from Server Virtualization: Beyond Initial Workload
>>> Consolidation -- Increasing the use of server virtualization is a top
>>> priority.Virtualization can reduce costs, simplify management, and improve
>>> application availability and disaster protection. Learn more about boosting
>>> the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
>>> _______________________________________________
>>> Gmod-schema mailing list
>>> [hidden email]
>>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>>
>>
>>
>>
>> --
>> ------------------------------------------------------------------------
>> Scott Cain, Ph. D.                                   scott at scottcain dot net
>> GMOD Coordinator (http://gmod.org/)                     216-392-3087
>> Ontario Institute for Cancer Research
>>
>> ------------------------------------------------------------------------------
>> Benefiting from Server Virtualization: Beyond Initial Workload
>> Consolidation -- Increasing the use of server virtualization is a top
>> priority.Virtualization can reduce costs, simplify management, and improve
>> application availability and disaster protection. Learn more about boosting
>> the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
>> _______________________________________________
>> Gmod-schema mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
> ------------------------------------------------------------------------------
> Fulfilling the Lean Software Promise
> Lean software platforms are now widely adopted and the benefits have been
> demonstrated beyond question. Learn why your peers are replacing JEE
> containers with lightweight application servers - and what you can gain
> from the move. http://p.sf.net/sfu/vmware-sfemails
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema

------------------------------------------------------------------------------
Fulfilling the Lean Software Promise
Lean software platforms are now widely adopted and the benefits have been
demonstrated beyond question. Learn why your peers are replacing JEE
containers with lightweight application servers - and what you can gain
from the move. http://p.sf.net/sfu/vmware-sfemails
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Guidance on loading multiple assemblies, feature naming, etc.

Scott Cain
In reply to this post by Yuri Bendana-3
Hi Yuri,

That is pretty much along the lines of what I was thinking too, maybe with a collection_pub table too. 

Scott

--
Scott Cain, Ph. D.
scott at scottcain dot net
Ontario Institute for Cancer Research
216 392 3087 

Sent from my phone. 

On Apr 21, 2011, at 7:45 PM, Yuri Bendana <[hidden email]> wrote:

Hi Scott,

This might help.  I made a simple nd_collection table that is not officially part of NatDiv that links to stock,experiment and project.  It could be generalized to be part of a collection module.  The module should also include a collection_relationship table to relate one collection to another.  Here's an excerpt of the DDL:

CREATE TABLE nd_collection (
    nd_collection_id serial PRIMARY KEY,
    type_id integer NOT NULL REFERENCES cvterm (cvterm_id) ON DELETE CASCADE INITIALLY DEFERRED 
);

CREATE TABLE nd_collectionprop (
    nd_collectionprop_id serial PRIMARY KEY,
    nd_collection_id integer NOT NULL REFERENCES nd_collection ON DELETE CASCADE INITIALLY DEFERRED,
    type_id integer NOT NULL REFERENCES cvterm (cvterm_id) ON DELETE CASCADE INITIALLY DEFERRED,
    value text,
    units_id integer REFERENCES cvterm (cvterm_id) ON DELETE SET NULL INITIALLY DEFERRED,
    rank integer NOT NULL DEFAULT 0,
    CONSTRAINT nd_collectionprop_c1 UNIQUE (nd_collection_id,type_id,rank)
);

CREATE TABLE nd_collection_stock (
    nd_collection_stock_id serial PRIMARY KEY,
    nd_collection_id integer NOT NULL REFERENCES nd_collection ON DELETE CASCADE INITIALLY DEFERRED,
    stock_id integer NOT NULL REFERENCES stock ON DELETE CASCADE INITIALLY DEFERRED,
    CONSTRAINT nd_collection_stock_c1 UNIQUE (nd_collection_id,stock_id)
);

yuri

On Thu, Apr 21, 2011 at 9:51 AM, Scott Cain <[hidden email]> wrote:
Hi Naama,

Interesting idea--how would you see that working?  Would there be a
"collections" module that would consist of any "thingies" from
somewhere else in the database (identified by table and table_id) and
a relationship term tying them to the collection_id that groups them
together?

Scott


On Thu, Apr 21, 2011 at 11:43 AM, Naama Menda <[hidden email]> wrote:
> there's been a lot of talking on this list on how to  group different
> objects.
> Projects, evidences, gene families, organisms, properties, and now sequence.
>
> Probably a good idea to start a new chado module for grouping objects.
>
> -Naama
>
>
> On Thu, Apr 21, 2011 at 11:17 AM, Olen Vance Sluder Jr <[hidden email]> wrote:
>>
>> Well, after a little more study of the Sequence Ontology (SO) and the
>> Chado schema, I decided I can't precisely represent what I want in
>> Chado currently, i.e., a SO sequence_collection or reference_genome.
>> Though Don's recommendation has its attractions, I will go the route
>> of encoding the reference genome in the uniquename of the feature per
>> Rob as it gets me around the bug in Apollo too.
>>
>> This does lead me to a question about extending the schema to store
>> information about a SO sequence_collection. Would such an extension
>> make sense as part of the existing Sequence module or a separate
>> module, e.g., Sequence_Collection? I'm leaning towards the latter and,
>> if it has utility beyond me, I can put together some DDL or an ERD for
>> comment.
>> --
>> Olen
>>
>>
>> On Mon, Apr 18, 2011 at 4:30 PM, Olen Vance Sluder Jr wrote:
>> > On Mon, Apr 18, 2011 at 1:50 PM, Robert Buels wrote:
>> >> On 04/18/2011 11:10 AM, Don Gilbert wrote:
>> >>> I've not done this, but think it would work with chado to handle
>> >>> multiple assemblies,
>> >>> of one species:  give each assembly a new organism/species ID/name.
>> >>>  Since all the
>> >>> chado features are tagged by organism id, that would let you go a
>> >>> simple route of
>> >>> not changing chromosome, feature IDs, but only the organism id.
>> >>
>> >> However, the downside of this is that you then have fake species in
>> >> your
>> >> organism table.  If you wanted to, for example, query for a summary of
>> >> what features you have for a given organism, you would have to figure
>> >> out what those organism rows would be.
>> >>
>> >> Perhaps you would need some sort of way of grouping organisms in that
>> >> case.  I seem to remember that idea being brought up before.
>> >
>> > Don,
>> >
>> > I considered that approach based upon a comment in the GMOD wiki on
>> > the Organism module <http://gmod.org/wiki/Chado_Organism_Module>:
>> >
>> > "If a particular strain or subspecies is to be represented, this is
>> > appended onto the species name."
>> >
>> > I could consider each assembly a new "strain", but then I reconsidered
>> > for exactly Rob's reasoning. That's when I started to dig into the SO
>> > and came across CV terms for assemblies, golden_paths, etc., got
>> > really confused, and wrote my original email.
>> >
>> > The feature naming issue is also related to a bug in Apollo accessing
>> > Chado that was discussed at the recent GMOD spring training and
>> > subsequent email with Ed Lee where Apollo does not properly handle
>> > multiple organisms (or assemblies) with non-unique feature names.
>> >
>> > I'm unsure how this will all work as far as configuring GBrowse2, but
>> > I haven't dug into that yet.
>> > --
>> > Olen
>> >
>>
>>
>> ------------------------------------------------------------------------------
>> Benefiting from Server Virtualization: Beyond Initial Workload
>> Consolidation -- Increasing the use of server virtualization is a top
>> priority.Virtualization can reduce costs, simplify management, and improve
>> application availability and disaster protection. Learn more about
>> boosting
>> the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
>> _______________________________________________
>> Gmod-schema mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
>
> ------------------------------------------------------------------------------
> Benefiting from Server Virtualization: Beyond Initial Workload
> Consolidation -- Increasing the use of server virtualization is a top
> priority.Virtualization can reduce costs, simplify management, and improve
> application availability and disaster protection. Learn more about boosting
> the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>
>



--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     <a href="tel:216-392-3087" value="+12163923087">216-392-3087
Ontario Institute for Cancer Research

------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve
application availability and disaster protection. Learn more about boosting
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema


------------------------------------------------------------------------------
Fulfilling the Lean Software Promise
Lean software platforms are now widely adopted and the benefits have been
demonstrated beyond question. Learn why your peers are replacing JEE
containers with lightweight application servers - and what you can gain
from the move. http://p.sf.net/sfu/vmware-sfemails
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Guidance on loading multiple assemblies, feature naming, etc.

Robert Buels
It is not part of ND, its name should not be prefixed with nd_.

Though table prefixes are a really good habit for Chado to start getting
into, I think.

So, what prefix should be put on that table, think y'all?

Rob

P.S.  Think y'all is a wonderful expression; I shall endeavor to
popularize its use.


On 04/22/2011 08:54 AM, Scott Cain wrote:

> Hi Yuri,
>
> That is pretty much along the lines of what I was thinking too, maybe
> with a collection_pub table too.
>
> Scott
>
> --
> Scott Cain, Ph. D.
> scott at scottcain dot net
> Ontario Institute for Cancer Research
> http://gmod.org/
> 216 392 3087
>
> Sent from my phone.
>
> On Apr 21, 2011, at 7:45 PM, Yuri Bendana <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>> Hi Scott,
>>
>> This might help. I made a simple nd_collection table that is not
>> officially part of NatDiv that links to stock,experiment and project.
>> It could be generalized to be part of a collection module. The module
>> should also include a collection_relationship table to relate one
>> collection to another. Here's an excerpt of the DDL:
>>
>> CREATE TABLE nd_collection (
>> nd_collection_id serial PRIMARY KEY,
>> type_id integer NOT NULL REFERENCES cvterm (cvterm_id) ON DELETE
>> CASCADE INITIALLY DEFERRED
>> );
>>
>> CREATE TABLE nd_collectionprop (
>> nd_collectionprop_id serial PRIMARY KEY,
>> nd_collection_id integer NOT NULL REFERENCES nd_collection ON DELETE
>> CASCADE INITIALLY DEFERRED,
>> type_id integer NOT NULL REFERENCES cvterm (cvterm_id) ON DELETE
>> CASCADE INITIALLY DEFERRED,
>> value text,
>> units_id integer REFERENCES cvterm (cvterm_id) ON DELETE SET NULL
>> INITIALLY DEFERRED,
>> rank integer NOT NULL DEFAULT 0,
>> CONSTRAINT nd_collectionprop_c1 UNIQUE (nd_collection_id,type_id,rank)
>> );
>>
>> CREATE TABLE nd_collection_stock (
>> nd_collection_stock_id serial PRIMARY KEY,
>> nd_collection_id integer NOT NULL REFERENCES nd_collection ON DELETE
>> CASCADE INITIALLY DEFERRED,
>> stock_id integer NOT NULL REFERENCES stock ON DELETE CASCADE INITIALLY
>> DEFERRED,
>> CONSTRAINT nd_collection_stock_c1 UNIQUE (nd_collection_id,stock_id)
>> );
>>
>> yuri
>>
>> On Thu, Apr 21, 2011 at 9:51 AM, Scott Cain
>> <<mailto:[hidden email]>[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>     Hi Naama,
>>
>>     Interesting idea--how would you see that working? Would there be a
>>     "collections" module that would consist of any "thingies" from
>>     somewhere else in the database (identified by table and table_id) and
>>     a relationship term tying them to the collection_id that groups them
>>     together?
>>
>>     Scott
>>
>>
>>     On Thu, Apr 21, 2011 at 11:43 AM, Naama Menda
>>     <<mailto:[hidden email]>[hidden email]
>>     <mailto:[hidden email]>> wrote:
>>     > there's been a lot of talking on this list on how to group different
>>     > objects.
>>     > Projects, evidences, gene families, organisms, properties, and
>>     now sequence.
>>     >
>>     > Probably a good idea to start a new chado module for grouping
>>     objects.
>>     >
>>     > -Naama
>>     >
>>     >
>>     > On Thu, Apr 21, 2011 at 11:17 AM, Olen Vance Sluder Jr
>>     <<mailto:[hidden email]>[hidden email] <mailto:[hidden email]>> wrote:
>>     >>
>>     >> Well, after a little more study of the Sequence Ontology (SO)
>>     and the
>>     >> Chado schema, I decided I can't precisely represent what I want in
>>     >> Chado currently, i.e., a SO sequence_collection or
>>     reference_genome.
>>     >> Though Don's recommendation has its attractions, I will go the
>>     route
>>     >> of encoding the reference genome in the uniquename of the
>>     feature per
>>     >> Rob as it gets me around the bug in Apollo too.
>>     >>
>>     >> This does lead me to a question about extending the schema to store
>>     >> information about a SO sequence_collection. Would such an extension
>>     >> make sense as part of the existing Sequence module or a separate
>>     >> module, e.g., Sequence_Collection? I'm leaning towards the
>>     latter and,
>>     >> if it has utility beyond me, I can put together some DDL or an
>>     ERD for
>>     >> comment.
>>     >> --
>>     >> Olen
>>     >>
>>     >>
>>     >> On Mon, Apr 18, 2011 at 4:30 PM, Olen Vance Sluder Jr wrote:
>>     >> > On Mon, Apr 18, 2011 at 1:50 PM, Robert Buels wrote:
>>     >> >> On 04/18/2011 11:10 AM, Don Gilbert wrote:
>>     >> >>> I've not done this, but think it would work with chado to
>>     handle
>>     >> >>> multiple assemblies,
>>     >> >>> of one species: give each assembly a new organism/species
>>     ID/name.
>>     >> >>> Since all the
>>     >> >>> chado features are tagged by organism id, that would let
>>     you go a
>>     >> >>> simple route of
>>     >> >>> not changing chromosome, feature IDs, but only the organism id.
>>     >> >>
>>     >> >> However, the downside of this is that you then have fake
>>     species in
>>     >> >> your
>>     >> >> organism table. If you wanted to, for example, query for a
>>     summary of
>>     >> >> what features you have for a given organism, you would have
>>     to figure
>>     >> >> out what those organism rows would be.
>>     >> >>
>>     >> >> Perhaps you would need some sort of way of grouping
>>     organisms in that
>>     >> >> case. I seem to remember that idea being brought up before.
>>     >> >
>>     >> > Don,
>>     >> >
>>     >> > I considered that approach based upon a comment in the GMOD
>>     wiki on
>>     >> > the Organism module
>>     <<http://gmod.org/wiki/Chado_Organism_Module>http://gmod.org/wiki/Chado_Organism_Module>:
>>     >> >
>>     >> > "If a particular strain or subspecies is to be represented,
>>     this is
>>     >> > appended onto the species name."
>>     >> >
>>     >> > I could consider each assembly a new "strain", but then I
>>     reconsidered
>>     >> > for exactly Rob's reasoning. That's when I started to dig
>>     into the SO
>>     >> > and came across CV terms for assemblies, golden_paths, etc., got
>>     >> > really confused, and wrote my original email.
>>     >> >
>>     >> > The feature naming issue is also related to a bug in Apollo
>>     accessing
>>     >> > Chado that was discussed at the recent GMOD spring training and
>>     >> > subsequent email with Ed Lee where Apollo does not properly
>>     handle
>>     >> > multiple organisms (or assemblies) with non-unique feature names.
>>     >> >
>>     >> > I'm unsure how this will all work as far as configuring
>>     GBrowse2, but
>>     >> > I haven't dug into that yet.
>>     >> > --
>>     >> > Olen
>>     >> >
>>     >>
>>     >>
>>     >>
>>     ------------------------------------------------------------------------------
>>     >> Benefiting from Server Virtualization: Beyond Initial Workload
>>     >> Consolidation -- Increasing the use of server virtualization is
>>     a top
>>     >> priority.Virtualization can reduce costs, simplify management,
>>     and improve
>>     >> application availability and disaster protection. Learn more about
>>     >> boosting
>>     >> the value of server virtualization.
>>     <http://p.sf.net/sfu/vmware-sfdev2dev>http://p.sf.net/sfu/vmware-sfdev2dev
>>     >> _______________________________________________
>>     >> Gmod-schema mailing list
>>     >>
>>     <mailto:[hidden email]>[hidden email]
>>     <mailto:[hidden email]>
>>     >>
>>     <https://lists.sourceforge.net/lists/listinfo/gmod-schema>https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>     >
>>     >
>>     >
>>     ------------------------------------------------------------------------------
>>     > Benefiting from Server Virtualization: Beyond Initial Workload
>>     > Consolidation -- Increasing the use of server virtualization is
>>     a top
>>     > priority.Virtualization can reduce costs, simplify management,
>>     and improve
>>     > application availability and disaster protection. Learn more
>>     about boosting
>>     > the value of server virtualization.
>>     <http://p.sf.net/sfu/vmware-sfdev2dev>http://p.sf.net/sfu/vmware-sfdev2dev
>>     > _______________________________________________
>>     > Gmod-schema mailing list
>>     >
>>     <mailto:[hidden email]>[hidden email]
>>     <mailto:[hidden email]>
>>     >
>>     <https://lists.sourceforge.net/lists/listinfo/gmod-schema>https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>     >
>>     >
>>
>>
>>
>>     --
>>     ------------------------------------------------------------------------
>>     Scott Cain, Ph. D. scott at scottcain dot net
>>     GMOD Coordinator (<http://gmod.org/>http://gmod.org/) 216-392-3087
>>     <tel:216-392-3087>
>>     Ontario Institute for Cancer Research
>>
>>     ------------------------------------------------------------------------------
>>     Benefiting from Server Virtualization: Beyond Initial Workload
>>     Consolidation -- Increasing the use of server virtualization is a top
>>     priority.Virtualization can reduce costs, simplify management, and
>>     improve
>>     application availability and disaster protection. Learn more about
>>     boosting
>>     the value of server virtualization.
>>     <http://p.sf.net/sfu/vmware-sfdev2dev>http://p.sf.net/sfu/vmware-sfdev2dev
>>     _______________________________________________
>>     Gmod-schema mailing list
>>     <mailto:[hidden email]>[hidden email]
>>     <mailto:[hidden email]>
>>     <https://lists.sourceforge.net/lists/listinfo/gmod-schema>https://lists.sourceforge.net/lists/listinfo/gmod-schema
>>
>>
>
>
> ------------------------------------------------------------------------------
> Fulfilling the Lean Software Promise
> Lean software platforms are now widely adopted and the benefits have been
> demonstrated beyond question. Learn why your peers are replacing JEE
> containers with lightweight application servers - and what you can gain
> from the move. http://p.sf.net/sfu/vmware-sfemails
>
>
>
> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema


------------------------------------------------------------------------------
Fulfilling the Lean Software Promise
Lean software platforms are now widely adopted and the benefits have been
demonstrated beyond question. Learn why your peers are replacing JEE
containers with lightweight application servers - and what you can gain
from the move. http://p.sf.net/sfu/vmware-sfemails
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
12