Loading GFF / Fasta Questions | CHADO

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Loading GFF / Fasta Questions | CHADO

Shane McCoy
Hello! I am preparing in the coming weeks to load data for a fish genome into PostgreSQL/Chado to be used w/ Jbrowse/WebApollo. 

The files I am receiving from the Maker output are coming in 5 parts, as they were too large to to do in one take. Each of the 5 GFF files comes w/ Protein and Transcript (Augustus, SNAP & Maker for each) .fasta files (so 6 total per GFF file). As well as one .fa file of all the .fasta files in one. 
this is what i'm looking at essentially, 
GFF_part1_of_5.gff
      AUG_Protein_1_5.fasta
      AUG_Tranx_1_5.fasta 
      SNAP_Protein_1_5.fasta
      SNAP_Tranx_1_5.fasta etc(maker protein/tranx)
 AUG_SNAP_MAKER_PROTEIN_TRANX_1_5.fa (all in one fasta)


Following the 2013 Tutorial I understand loading all 5 GFF files in the bulk loader but i'm not fully grasping how to handle the .fasta files. 
please bear w/ me as i am not used to working w/ these files :) 

It is noted in the 2013 Tutorial under 'Preparing GFF data for loading'
for 'fasta --> GFF' but gives no details. 
How are the .fasta files loaded into the database? I saw a short description on converting into gff3 files from genbank. Is this the proper course, to convert the .fasta to gff? 

If so, would i be ok just converting the (all in one) .fa file or should i convert all 6 .fasta files individually? 

Also, is there any issue w/ the fact that there are 5 parts to each .fasta file to go w/ the 5 parts of GFF? 

I'd appreciate any feedback, i am looking forward to learning how to use CHADO more extensively in the future, 
Thanks for your time!
Shane




------------------------------------------------------------------------------

_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Loading GFF / Fasta Questions | CHADO

Andrew Farmer
Hi Shane-
if I'm understanding your question correctly, I think you're asking about how to get the sequence data
in the Transcript/Protein fasta files to be added to the feature.residues field in chado for the mRNA and polypeptide
features that the gff bulk loader creates?

There may be better approaches, but I believe this can be done using gmod_bulk_load_gff3.pl with the --fastafile option.
Others on the list can probably give you better guidance about this, but in my experience this can be a little tricky because
of the way that the loader assigns auto-generated names to the polypeptide features it creates. In the loading process our
group is using, we've so far just worked around this by first sql-updating the auto-assigned polypeptide names after running
the loader on the gff, then running the loader again with the fastafile option supplied. One potential difficulty with this is that I
believe the code that does the lookup to get features matching the ids in the fasta headers does not have information
about the type of feature, so if you have mRNA with the same uniquename as the corresponding polypeptide, it
will likely get confused.

I think there have been a few discussions on this list before about possible changes to the way this loader handles
polypeptide naming, so maybe this would be a good use case for driving that topic forward?

hope that helps, or at least provokes others with more knowledge of the loader to suggest better alternatives...

Andrew Farmer






On 11/5/14 2:26 PM, Shane McCoy wrote:
Hello! I am preparing in the coming weeks to load data for a fish genome into PostgreSQL/Chado to be used w/ Jbrowse/WebApollo. 

The files I am receiving from the Maker output are coming in 5 parts, as they were too large to to do in one take. Each of the 5 GFF files comes w/ Protein and Transcript (Augustus, SNAP & Maker for each) .fasta files (so 6 total per GFF file). As well as one .fa file of all the .fasta files in one. 
this is what i'm looking at essentially, 
GFF_part1_of_5.gff
      AUG_Protein_1_5.fasta
      AUG_Tranx_1_5.fasta 
      SNAP_Protein_1_5.fasta
      SNAP_Tranx_1_5.fasta etc(maker protein/tranx)
 AUG_SNAP_MAKER_PROTEIN_TRANX_1_5.fa (all in one fasta)


Following the 2013 Tutorial I understand loading all 5 GFF files in the bulk loader but i'm not fully grasping how to handle the .fasta files. 
please bear w/ me as i am not used to working w/ these files :) 

It is noted in the 2013 Tutorial under 'Preparing GFF data for loading'
for 'fasta --> GFF' but gives no details. 
How are the .fasta files loaded into the database? I saw a short description on converting into gff3 files from genbank. Is this the proper course, to convert the .fasta to gff? 

If so, would i be ok just converting the (all in one) .fa file or should i convert all 6 .fasta files individually? 

Also, is there any issue w/ the fact that there are 5 parts to each .fasta file to go w/ the 5 parts of GFF? 

I'd appreciate any feedback, i am looking forward to learning how to use CHADO more extensively in the future, 
Thanks for your time!
Shane





------------------------------------------------------------------------------


_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema

-- 
...all concepts in which an entire process is semiotically concentrated
elude definition; only that which has no history is definable.

Friedrich Nietzsche

------------------------------------------------------------------------------

_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Loading GFF / Fasta Questions | CHADO

vkrishna
Hi Shane,

Given that you have Maker output, have you by any chance tried using the `maker2chado` script?
The developers of Maker have built an easy-to-use wrapper, optimized specifically for the GFF3 (containing embedded FASTA) generated by their pipeline. Just an FYI.

More documentation about this and other very useful scripts (e.g. maker2jbrowse, chado2gff3, etc.) is available within the MAKER release tarball.

Thank you.
Vivek

On Nov 6, 2014, at 10:52 AM, Andrew Farmer <[hidden email]> wrote:

Hi Shane-
if I'm understanding your question correctly, I think you're asking about how to get the sequence data
in the Transcript/Protein fasta files to be added to the feature.residues field in chado for the mRNA and polypeptide
features that the gff bulk loader creates?

There may be better approaches, but I believe this can be done using gmod_bulk_load_gff3.pl with the --fastafile option.
Others on the list can probably give you better guidance about this, but in my experience this can be a little tricky because
of the way that the loader assigns auto-generated names to the polypeptide features it creates. In the loading process our
group is using, we've so far just worked around this by first sql-updating the auto-assigned polypeptide names after running
the loader on the gff, then running the loader again with the fastafile option supplied. One potential difficulty with this is that I
believe the code that does the lookup to get features matching the ids in the fasta headers does not have information
about the type of feature, so if you have mRNA with the same uniquename as the corresponding polypeptide, it
will likely get confused.

I think there have been a few discussions on this list before about possible changes to the way this loader handles
polypeptide naming, so maybe this would be a good use case for driving that topic forward?

hope that helps, or at least provokes others with more knowledge of the loader to suggest better alternatives...

Andrew Farmer






On 11/5/14 2:26 PM, Shane McCoy wrote:
Hello! I am preparing in the coming weeks to load data for a fish genome into PostgreSQL/Chado to be used w/ Jbrowse/WebApollo. 

The files I am receiving from the Maker output are coming in 5 parts, as they were too large to to do in one take. Each of the 5 GFF files comes w/ Protein and Transcript (Augustus, SNAP & Maker for each) .fasta files (so 6 total per GFF file). As well as one .fa file of all the .fasta files in one. 
this is what i'm looking at essentially, 
GFF_part1_of_5.gff
      AUG_Protein_1_5.fasta
      AUG_Tranx_1_5.fasta 
      SNAP_Protein_1_5.fasta
      SNAP_Tranx_1_5.fasta etc(maker protein/tranx)
 AUG_SNAP_MAKER_PROTEIN_TRANX_1_5.fa (all in one fasta)


Following the 2013 Tutorial I understand loading all 5 GFF files in the bulk loader but i'm not fully grasping how to handle the .fasta files. 
please bear w/ me as i am not used to working w/ these files :) 

It is noted in the 2013 Tutorial under 'Preparing GFF data for loading'
for 'fasta --> GFF' but gives no details. 
How are the .fasta files loaded into the database? I saw a short description on converting into gff3 files from genbank. Is this the proper course, to convert the .fasta to gff? 

If so, would i be ok just converting the (all in one) .fa file or should i convert all 6 .fasta files individually? 

Also, is there any issue w/ the fact that there are 5 parts to each .fasta file to go w/ the 5 parts of GFF? 

I'd appreciate any feedback, i am looking forward to learning how to use CHADO more extensively in the future, 
Thanks for your time!
Shane





------------------------------------------------------------------------------


_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema

-- 
...all concepts in which an entire process is semiotically concentrated
elude definition; only that which has no history is definable.

Friedrich Nietzsche
------------------------------------------------------------------------------
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema