standardized database and CV usage

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

standardized database and CV usage

Sanjuro Jogdeo-2
Hello Chadonistas, 

I'm working on a project that involves extensive field collections of wild strawberry and subsequent genotyping and phenotyping of specific progeny.  We're using Tripal/Chado to store and provide access to the data and I've installed the ND Genotypes Extension for Tripal, which has been very helpful.  I have a few questions related to this type of usage.

1.  We have a set of genotypes that are based on ~1200bp sequences.  If I understand the ND Extension correctly, the sequence of each stock's genotype is supposed to go into genotype.description, which is a varchar(255) field.  If there is a reference genotype, the sequence goes into feature.residues, which is a text field.  Do I need to change genotype.description data type to text in order to store the ~1200bp sequences?  Or is there some other ways to accommodate longer sequences that are associated with genotypes?  Or am I misunderstanding how I'm supposed to be using the tables?

2.  My other question is about CV terms used for storing natural diversity data.  I was thinking it would be nice to match as closely as possible the data structure and CV usage of other similar databases (esp. GDR?).  It seems like using the ND Extension will help make our database usage consistent with others, but I'm still not sure about the CV terms.  Is it feasible to try to be consistent with other databases?  Or should I just come up with my own CV terms when I can't find suitable terms in an ontology and figure out data mapping with other databases later.

3.  Speaking of ontologies, I've found the PATO and Plant Trait ontologies mentioned here.  Are there other plant ontologies that you've found useful?

Any input would be great!  Thanks!

Sanjuro

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: standardized database and CV usage

Chris Mungall-5
Hi Sanjuro,

Some ontologies specific to plants you may find useful; the list may not
be complete
http://planteome.org/node/1

There are various species-specific crop ontologies but I'm not aware of
a strawberry one.

On 14 May 2015, at 8:21, Sanjuro Jogdeo wrote:

> Hello Chadonistas,
>
> I'm working on a project that involves extensive field collections of
> wild
> strawberry and subsequent genotyping and phenotyping of specific
> progeny.
> We're using Tripal/Chado to store and provide access to the data and
> I've
> installed the ND Genotypes Extension for Tripal, which has been very
> helpful.  I have a few questions related to this type of usage.
>
> 1.  We have a set of genotypes that are based on ~1200bp sequences.  
> If I
> understand the ND Extension correctly, the sequence of each stock's
> genotype is supposed to go into genotype.description, which is a
> varchar(255) field.  If there is a reference genotype, the sequence
> goes
> into feature.residues, which is a text field.  Do I need to change
> genotype.description data type to text in order to store the ~1200bp
> sequences?  Or is there some other ways to accommodate longer
> sequences
> that are associated with genotypes?  Or am I misunderstanding how I'm
> supposed to be using the tables?
>
> 2.  My other question is about CV terms used for storing natural
> diversity
> data.  I was thinking it would be nice to match as closely as possible
> the
> data structure and CV usage of other similar databases (esp. GDR?).  
> It
> seems like using the ND Extension will help make our database usage
> consistent with others, but I'm still not sure about the CV terms.  Is
> it
> feasible to try to be consistent with other databases?  Or should I
> just
> come up with my own CV terms when I can't find suitable terms in an
> ontology and figure out data mapping with other databases later.
>
> 3.  Speaking of ontologies, I've found the PATO and Plant Trait
> ontologies
> mentioned here
> <http://wiki.obofoundry.org/wiki/index.php/PATO:Main_Page>.
> Are there other plant ontologies that you've found useful?
>
> Any input would be great!  Thanks!
>
> Sanjuro
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across
> Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable
> Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y_______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: standardized database and CV usage

Sook Jung
In reply to this post by Sanjuro Jogdeo-2
Hi Sanjuro,

Please see below. 

On Thu, May 14, 2015 at 11:21 AM, Sanjuro Jogdeo <[hidden email]> wrote:
Hello Chadonistas, 

I'm working on a project that involves extensive field collections of wild strawberry and subsequent genotyping and phenotyping of specific progeny.  We're using Tripal/Chado to store and provide access to the data and I've installed the ND Genotypes Extension for Tripal, which has been very helpful.  I have a few questions related to this type of usage.

1.  We have a set of genotypes that are based on ~1200bp sequences.  If I understand the ND Extension correctly, the sequence of each stock's genotype is supposed to go into genotype.description, which is a varchar(255) field.  If there is a reference genotype, the sequence goes into feature.residues, which is a text field.  Do I need to change genotype.description data type to text in order to store the ~1200bp sequences?  Or is there some other ways to accommodate longer sequences that are associated with genotypes?  Or am I misunderstanding how I'm supposed to be using the tables?
I think genotype.description meant to store alleles or haplotypes (or combination of alleles/halotypes - genotype), such as product size for SSR, SNP alleles, and haplotype names, not the actual sequences.. Are you planning to look at the sequences of 1200 bp and figure out certain haplotypes and/or find SNPs to classify the wild strawberry collections - then you could store those genotypes in genotype.description. I'm not sure it is meant for storing actual sequences.
 
If you do need to store all those sequences I'm not sure what would be the best solution - other people who uses ND may have something to say?


2.  My other question is about CV terms used for storing natural diversity data.  I was thinking it would be nice to match as closely as possible the data structure and CV usage of other similar databases (esp. GDR?).  It seems like using the ND Extension will help make our database usage consistent with others, but I'm still not sure about the CV terms.  Is it feasible to try to be consistent with other databases?  Or should I just come up with my own CV terms when I can't find suitable terms in an ontology and figure out data mapping with other databases later.

For now we (GDR) use TO to describe Rosaceae traits and also are in the process of developing CO (Crop ontology) specific for Rosaceae. We use SO as well. For all other terms (in various prop tables), we just use our own controlled vocabulary. We are developing Tripal modules to cover genotype/phenotype/marker/map/QTL in collaboration with other databases, and the cvterms for those modules will me more standardized by then (need more inputs/work).. Since you are working on Fragaria, it would be nice though if we can work together though..

3.  Speaking of ontologies, I've found the PATO and Plant Trait ontologies mentioned here.  Are there other plant ontologies that you've found useful?

Any input would be great!  Thanks!

Sanjuro

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema

 


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: standardized database and CV usage

Sook Jung
Hi Sanjuro,
I was thinking how you can store all the sequences for various germplasm collections and this could be a possibility. You can store all the sequences in feature table, link to germplasm by feature_stock, link all the feature to the reference sequence by feature_relationship table and then link all the features to one analysis (analysis table). The analysis can be link to a project (I think Stephen proposed to add analysis_project table?) that is also linked to a project for phenotypes for the same germplasm. Multiple projects can belong to one super-project using project_relationship table.

I would think we will see more of data from genotyping by sequencing (without SNP discovery/haplotyping) so we may need to consider a way to store these data..

Thanks
Sook

On Thu, May 21, 2015 at 11:11 AM, Sook Jung <[hidden email]> wrote:
Hi Sanjuro,

Please see below. 

On Thu, May 14, 2015 at 11:21 AM, Sanjuro Jogdeo <[hidden email]> wrote:
Hello Chadonistas, 

I'm working on a project that involves extensive field collections of wild strawberry and subsequent genotyping and phenotyping of specific progeny.  We're using Tripal/Chado to store and provide access to the data and I've installed the ND Genotypes Extension for Tripal, which has been very helpful.  I have a few questions related to this type of usage.

1.  We have a set of genotypes that are based on ~1200bp sequences.  If I understand the ND Extension correctly, the sequence of each stock's genotype is supposed to go into genotype.description, which is a varchar(255) field.  If there is a reference genotype, the sequence goes into feature.residues, which is a text field.  Do I need to change genotype.description data type to text in order to store the ~1200bp sequences?  Or is there some other ways to accommodate longer sequences that are associated with genotypes?  Or am I misunderstanding how I'm supposed to be using the tables?
I think genotype.description meant to store alleles or haplotypes (or combination of alleles/halotypes - genotype), such as product size for SSR, SNP alleles, and haplotype names, not the actual sequences.. Are you planning to look at the sequences of 1200 bp and figure out certain haplotypes and/or find SNPs to classify the wild strawberry collections - then you could store those genotypes in genotype.description. I'm not sure it is meant for storing actual sequences.
 
If you do need to store all those sequences I'm not sure what would be the best solution - other people who uses ND may have something to say?


2.  My other question is about CV terms used for storing natural diversity data.  I was thinking it would be nice to match as closely as possible the data structure and CV usage of other similar databases (esp. GDR?).  It seems like using the ND Extension will help make our database usage consistent with others, but I'm still not sure about the CV terms.  Is it feasible to try to be consistent with other databases?  Or should I just come up with my own CV terms when I can't find suitable terms in an ontology and figure out data mapping with other databases later.

For now we (GDR) use TO to describe Rosaceae traits and also are in the process of developing CO (Crop ontology) specific for Rosaceae. We use SO as well. For all other terms (in various prop tables), we just use our own controlled vocabulary. We are developing Tripal modules to cover genotype/phenotype/marker/map/QTL in collaboration with other databases, and the cvterms for those modules will me more standardized by then (need more inputs/work).. Since you are working on Fragaria, it would be nice though if we can work together though..

3.  Speaking of ontologies, I've found the PATO and Plant Trait ontologies mentioned here.  Are there other plant ontologies that you've found useful?

Any input would be great!  Thanks!

Sanjuro

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema

 



------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: standardized database and CV usage

Karl O. Pinc
On Thu, 21 May 2015 11:29:08 -0400
Sook Jung <[hidden email]> wrote:

> Hi Sanjuro,
> I was thinking how you can store all the sequences for various
> germplasm collections and this could be a possibility. You can store
> all the sequences in feature table, link to germplasm by
> feature_stock, link all the feature to the reference sequence by
> feature_relationship table and then link all the features to one
> analysis (analysis table). The analysis can be link to a project (I
> think Stephen proposed to add analysis_project table?) that is also
> linked to a project for phenotypes for the same germplasm. Multiple
> projects can belong to one super-project using project_relationship
> table.
>
> I would think we will see more of data from genotyping by sequencing
> (without SNP discovery/haplotyping) so we may need to consider a way
> to store these data..

We are storing genotype data (really SNVs) of individuals
in the feature table.  There are multiple analysis of
the same individuals over time.  We don't use the stock table
but relate the feature to analysis and use dbxrefs to
relate back to the individuals.  See:

http://papio.biology.duke.edu/babase_chado_html/chado-vcf-load.html


Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: standardized database and CV usage

Sanjuro Jogdeo-2
Hi all, 

Thanks for the responses, sorry it's taking so long to respond, I was on vacation.  It sounds like Sook's and Karl's solutions are similar in terms of using the feature table to store sequences with a feature_relationship entry to link sample features to the reference feature.  Is there  a feature_stock table?  I don't see one in my implementation, though I do see a stock_genotype table.  So I could either link directly from stock through genotype, or go through the natural diversity tables.  I'm leaning towards the latter.

The list of CVs was helpful too (I'll thank the Gramene folks next time I pass them in the hallway :-) )

Thanks again for all the suggestions!  

Sanjuro



On Thu, May 21, 2015 at 10:21 AM, Karl O. Pinc <[hidden email]> wrote:
On Thu, 21 May 2015 11:29:08 -0400
Sook Jung <[hidden email]> wrote:

> Hi Sanjuro,
> I was thinking how you can store all the sequences for various
> germplasm collections and this could be a possibility. You can store
> all the sequences in feature table, link to germplasm by
> feature_stock, link all the feature to the reference sequence by
> feature_relationship table and then link all the features to one
> analysis (analysis table). The analysis can be link to a project (I
> think Stephen proposed to add analysis_project table?) that is also
> linked to a project for phenotypes for the same germplasm. Multiple
> projects can belong to one super-project using project_relationship
> table.
>
> I would think we will see more of data from genotyping by sequencing
> (without SNP discovery/haplotyping) so we may need to consider a way
> to store these data..

We are storing genotype data (really SNVs) of individuals
in the feature table.  There are multiple analysis of
the same individuals over time.  We don't use the stock table
but relate the feature to analysis and use dbxrefs to
relate back to the individuals.  See:

http://papio.biology.duke.edu/babase_chado_html/chado-vcf-load.html


Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein


------------------------------------------------------------------------------

_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: standardized database and CV usage

Sanjuro Jogdeo-2
Hi all, 

I've run into an additional unanticipated complication with storing our genotyping data.  Part of the purpose of the study is to sort out the taxonomic relationships between some of the closely related strawberries, which means some genotypes are shared between organisms.  The feature table uses unique combos of organism and uniquename, so when I load genotypes that are shared between organisms, multiple rows are created, one for each organism, even though the genotypes are the same.  It feels like this could complicate queries a bit, and I'm a little worried about other problems that I can't anticipate right now.  

I'm wondering if a better solution is to just create a custom table to store the sequence associated with the genotypes (essentially using a horrible method to add a column to the genotype table).  Our database is not (and likely will not be) large and I think the sin of an additional join will be less than the increasingly odd use of the feature table.  I thought actually adding a column to the genotype table would not be recommended.

Any thoughts?

Cheers, 

Sanjuro


On Tue, Jun 9, 2015 at 7:16 AM, Sanjuro Jogdeo <[hidden email]> wrote:
Hi all, 

Thanks for the responses, sorry it's taking so long to respond, I was on vacation.  It sounds like Sook's and Karl's solutions are similar in terms of using the feature table to store sequences with a feature_relationship entry to link sample features to the reference feature.  Is there  a feature_stock table?  I don't see one in my implementation, though I do see a stock_genotype table.  So I could either link directly from stock through genotype, or go through the natural diversity tables.  I'm leaning towards the latter.

The list of CVs was helpful too (I'll thank the Gramene folks next time I pass them in the hallway :-) )

Thanks again for all the suggestions!  

Sanjuro



On Thu, May 21, 2015 at 10:21 AM, Karl O. Pinc <[hidden email]> wrote:
On Thu, 21 May 2015 11:29:08 -0400
Sook Jung <[hidden email]> wrote:

> Hi Sanjuro,
> I was thinking how you can store all the sequences for various
> germplasm collections and this could be a possibility. You can store
> all the sequences in feature table, link to germplasm by
> feature_stock, link all the feature to the reference sequence by
> feature_relationship table and then link all the features to one
> analysis (analysis table). The analysis can be link to a project (I
> think Stephen proposed to add analysis_project table?) that is also
> linked to a project for phenotypes for the same germplasm. Multiple
> projects can belong to one super-project using project_relationship
> table.
>
> I would think we will see more of data from genotyping by sequencing
> (without SNP discovery/haplotyping) so we may need to consider a way
> to store these data..

We are storing genotype data (really SNVs) of individuals
in the feature table.  There are multiple analysis of
the same individuals over time.  We don't use the stock table
but relate the feature to analysis and use dbxrefs to
relate back to the individuals.  See:

http://papio.biology.duke.edu/babase_chado_html/chado-vcf-load.html


Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein



------------------------------------------------------------------------------

_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema