Re: Re: [GMOD-devel] GMODTools package preview

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Re: [GMOD-devel] GMODTools package preview

Don Gilbert
Scott,

Thanks much for the quick tryout.  The preliminary configurations
are be critical; I've used ENV{GMOD_ROOT} as a base for that, and see
your system won't allow you to write there. In the top of each
primary configuration file (e.g. sample conf/bulkfiles/sgdbulk.xml
or your rice revision), find

<opt
  name="sgdbulk"
  relid="5"
  date="20051129"
  ROOT="${GMOD_ROOT}/"
  TMP="${GMOD_ROOT}/tmp"
  datadir="genomes/Saccharomyces_cerevisiae"

Change these ROOT,TMP,datadir to some paths that you want to
be written to.  If you don't have GMOD_ROOT defined in environment,
it will use the GMODTools/ folder from the software, and should work
with the sample sgdlite lite data set.

One aspect I've not stressed well in the documents: proper configuration
for a given data release set is essential to get it working, and this
is an unusual program in that it need only be run once successfully for
such a data release set, then the generated bulk files can be used by all.
So expect to spend some time pondering the meaning of all those configuration
options which are lacking good documentation in order to get it working for
a new data set.

Once a data release set is configured to work, it should work repeatably (given
solution to things like a writable data root directory).
I'd recommend testing first with the sgdlite data set, and after getting that
to work, move on to a new data set.

I hope to add some pre-make validation checks before long that will help with
basic steps like "is your data output directory there?", "does your chado
genome db have chromosomes/golden_paths that can be found?", "does the
configured sql actually return data?"  Then folks can save time running
it on big datasets and wondering if they will get usable outputs.

Take a look at $ROOT/$datadir/$releasedir/tmp/featdump/ (from your config values)
for a 'chromosomes.tsv', an essential first step.  If that doesn't exist
and look valid for your organism's genome, the rest won't work.

- Don


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: Re: [GMOD-devel] GMODTools package preview

Scott Cain-2
Don,

I'll try out sgd data tomorrow.  I got 'out of memory' errors from DBI
after several hours (I accidentally closed the window, so I can't show
them to you).

Scott



On Wed, 2005-11-30 at 16:17 -0500, Don Gilbert wrote:

> Scott,
>
> Thanks much for the quick tryout.  The preliminary configurations
> are be critical; I've used ENV{GMOD_ROOT} as a base for that, and see
> your system won't allow you to write there. In the top of each
> primary configuration file (e.g. sample conf/bulkfiles/sgdbulk.xml
> or your rice revision), find
>
> <opt
>   name="sgdbulk"
>   relid="5"
>   date="20051129"
>   ROOT="${GMOD_ROOT}/"
>   TMP="${GMOD_ROOT}/tmp"
>   datadir="genomes/Saccharomyces_cerevisiae"
>
> Change these ROOT,TMP,datadir to some paths that you want to
> be written to.  If you don't have GMOD_ROOT defined in environment,
> it will use the GMODTools/ folder from the software, and should work
> with the sample sgdlite lite data set.
>
> One aspect I've not stressed well in the documents: proper configuration
> for a given data release set is essential to get it working, and this
> is an unusual program in that it need only be run once successfully for
> such a data release set, then the generated bulk files can be used by all.
> So expect to spend some time pondering the meaning of all those configuration
> options which are lacking good documentation in order to get it working for
> a new data set.
>
> Once a data release set is configured to work, it should work repeatably (given
> solution to things like a writable data root directory).
> I'd recommend testing first with the sgdlite data set, and after getting that
> to work, move on to a new data set.
>
> I hope to add some pre-make validation checks before long that will help with
> basic steps like "is your data output directory there?", "does your chado
> genome db have chromosomes/golden_paths that can be found?", "does the
> configured sql actually return data?"  Then folks can save time running
> it on big datasets and wondering if they will get usable outputs.
>
> Take a look at $ROOT/$datadir/$releasedir/tmp/featdump/ (from your config values)
> for a 'chromosomes.tsv', an essential first step.  If that doesn't exist
> and look valid for your organism's genome, the rest won't work.
>
> - Don
--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         [hidden email]
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Gmod-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-devel
Reply | Threaded
Open this post in threaded view
|

Re: Re: [GMOD-devel] GMODTools package preview

Don Gilbert
In reply to this post by Don Gilbert
Scott,

There is an update here (same release name, new date)
  curl -O http://eugenes.org/gmod/GMODTools/GMODTools-1.0.zip

which adds a few more validations:
dgbook%  perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make
  ..
  ERROR: Couldn't create path /usr/local/gmod//genomes/Saccharomyces_cerevisiae: ..
  ** Need writeable data dir=/usr/local/gmod//genomes/Saccharomyces_cerevisiae
  Change configuration datadir

For those of you, like Scott, who install gmod packages according
to directions in /usr/local/gmod and have GMOD_ROOT pointing there
and can't or don't want to write data there, use this addition:

  env GMOD_ROOT=`pwd` perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make

Or edit the bulkfiles config file to point to another data root.

Even though your rice chado db is no doubt large and will take
at least a few hours to write out all features to bulk files, the first
step of finding/writing a chromosomes table is quick (<minutes).
If it fails, the rest if the job can be killed.

-- Don


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [Gmod-schema] Re: GMODTools package preview

Scott Cain-2
Don,

I finally got back around to this.  I tried the sgdlite dump from from
Princeton; here is the output:

benicia:~/GMODTools cain$  perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make Config: name = yeast title = SGD Lite date = 20051129
Config: name = filesets title = Bulkfiles fileset definitions date = 20040821
Config: name = organisms date = 20051129
Config: name = featuresets title = Chado Feature mapping info date = 20040821
Config: name = yeast title = SGD Lite date = 20051129
Config: name = filesets title = Bulkfiles fileset definitions date = 20040821
Config: name = organisms date = 20051129
Config: name = featuresets title = Chado Feature mapping info date = 20040821
Config: name = chadofeatsql title = Chado DB SQL date = 20051129
Config: name = chadofeatsql title = Chado DB SQL date = 20051129
Config: name = chadofeatconv title = Chado DB Feature info
Automaking feature_table files
Missing feature_table files; make with -featdump
Automaking dna files
Missing fff files; need -format fff
Missing dna files; need -dnadump
Missing fasta files; need -format fasta
Config: name = blastfiles title = Blast index writer
Missing formatdb: ${ARGOS_ROOT}/common/servers/blast/Bin/formatdb at lib/Bio/GMOD/Bulkfiles/BlastWriter.pm line 90.
Bulkfiles done. result=fff+gff=ok, fasta=ok, blast=ok

formatdb failing doesn't surprise me, as it isn't installed, but what
are the messages about -featdump, -format and -dnadump?  If I try adding
-dnadump to the command line, I get lots more errors (looks like one per
chromosome):

missing dumpfile /usr/local/gmod/genomes/Saccharomyces_cerevisiae/sgdlite_2005_08_23/tmp/featdump/chadofeat-chrXIII.tsv at lib/Bio/GMOD/Bulkfiles/FeatureWriter.pm line 295.

I did get lots of stuff in the tmp directory, including chromosomes.tsv
and feature files for each chromosome.

So where should I go from here?

Thanks,
Scott


On Thu, 2005-12-01 at 00:41 -0500, Don Gilbert wrote:

> Scott,
>
> There is an update here (same release name, new date)
>   curl -O http://eugenes.org/gmod/GMODTools/GMODTools-1.0.zip
>
> which adds a few more validations:
> dgbook%  perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make
>   ..
>   ERROR: Couldn't create path /usr/local/gmod//genomes/Saccharomyces_cerevisiae: ..
>   ** Need writeable data dir=/usr/local/gmod//genomes/Saccharomyces_cerevisiae
>   Change configuration datadir
>
> For those of you, like Scott, who install gmod packages according
> to directions in /usr/local/gmod and have GMOD_ROOT pointing there
> and can't or don't want to write data there, use this addition:
>
>   env GMOD_ROOT=`pwd` perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make
>
> Or edit the bulkfiles config file to point to another data root.
>
> Even though your rice chado db is no doubt large and will take
> at least a few hours to write out all features to bulk files, the first
> step of finding/writing a chromosomes table is quick (<minutes).
> If it fails, the rest if the job can be killed.
>
> -- Don
--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         [hidden email]
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [Gmod-schema] Re: GMODTools package preview

Don Gilbert
In reply to this post by Don Gilbert

Scott,

There is an error early in the process, maybe at sql-dump stage, from this:
Automaking feature_table files
Missing feature_table files; make with -featdump

Use the '-debug' flag to get more info, including things like SQL errors.  
  perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make -debug

Maybe there is a mismatch in Postgres db access information.  Unless you
editted sgdbulk.xml, you are trying to use my non-standard Postgres port.

I'll change this before file release; I never use
standard ports for genome databases due to conflicts with standard uses/software.
I know this could use ENV{CHADO_DB_PORT} but with so many configs needing
special treatment for file releases, this was more precise; maybe should
let ENV{CHADO_DB_PORT} override.

edit GMODTools/conf/bulkfiles/sgdbulk.xml
<opt
  name="sgdbulk"
  relid="5"
  date="20051129"
  ROOT="${GMOD_ROOT}/"
  datadir="data/genomes/Saccharomyces_cerevisiae"
  >
  ..
  <db
    driver="Pg"
    name="sgdlite"
    host="localhost"
    port="7302"   << edit here; default maybe should be ${CHADO_DB_PORT} (need to test)
    user=""  << default likely should be ${CHADO_DB_USERNAME}
    password="" <<  "  ${CHADO_DB_PASSWORD}
    />

Look at files in
genomes/Saccharomyces_cerevisiae/sgdlite_2005_08_23/tmp/featdump/

Does a chromosomes.tsv exist there and have lines like this?
If not, the first step SQL dump of chromosome features failed.
melon.% more chromosomes.tsv
chrI    1       230208  0       10      chromosome      chrI    chrI    212     species Sacch
aromyces_cerevisiae
chrII   1       813178  0       10      chromosome      chrII   chrII   507     species Sacch
aromyces_cerevisiae
...

For sgdbulk files, these are only active SQL dump files:
   1407 Nov 30 20:02 chromosomes.tsv      < 1st sql dump
7614040 Nov 30 20:02 features.tsv         < 2nd sql dump

These are produced from above two:
 126766 Nov 30 20:02 chadofeat-scerchrI.tsv
 505105 Nov 30 20:02 chadofeat-scerchrII.tsv
 ...

-=- Don
.
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- [hidden email]--http://marmot.bio.indiana.edu/


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [Gmod-schema] Re: GMODTools package preview

Scott Cain-2
It seems that result files from earlier failures where somehow poisoning
the stew, because when I delete the output directory and reran,
everything worked fine.  I imagine you want to keep those directories to
work as a cache, but you might want a flag (or already have one) to do
the equiv of a make clean.

On Fri, 2005-12-09 at 13:30 -0500, Don Gilbert wrote:

> Scott,
>
> There is an error early in the process, maybe at sql-dump stage, from this:
> Automaking feature_table files
> Missing feature_table files; make with -featdump
>
> Use the '-debug' flag to get more info, including things like SQL errors.  
>   perl -Ilib bin/bulkfiles.pl -conf sgdbulk -make -debug
>
> Maybe there is a mismatch in Postgres db access information.  Unless you
> editted sgdbulk.xml, you are trying to use my non-standard Postgres port.
>
> I'll change this before file release; I never use
> standard ports for genome databases due to conflicts with standard uses/software.
> I know this could use ENV{CHADO_DB_PORT} but with so many configs needing
> special treatment for file releases, this was more precise; maybe should
> let ENV{CHADO_DB_PORT} override.
>
> edit GMODTools/conf/bulkfiles/sgdbulk.xml
> <opt
>   name="sgdbulk"
>   relid="5"
>   date="20051129"
>   ROOT="${GMOD_ROOT}/"
>   datadir="data/genomes/Saccharomyces_cerevisiae"
>   >
>   ..
>   <db
>     driver="Pg"
>     name="sgdlite"
>     host="localhost"
>     port="7302"   << edit here; default maybe should be ${CHADO_DB_PORT} (need to test)
>     user=""  << default likely should be ${CHADO_DB_USERNAME}
>     password="" <<  "  ${CHADO_DB_PASSWORD}
>     />
>
> Look at files in
> genomes/Saccharomyces_cerevisiae/sgdlite_2005_08_23/tmp/featdump/
>
> Does a chromosomes.tsv exist there and have lines like this?
> If not, the first step SQL dump of chromosome features failed.
> melon.% more chromosomes.tsv
> chrI    1       230208  0       10      chromosome      chrI    chrI    212     species Sacch
> aromyces_cerevisiae
> chrII   1       813178  0       10      chromosome      chrII   chrII   507     species Sacch
> aromyces_cerevisiae
> ...
>
> For sgdbulk files, these are only active SQL dump files:
>    1407 Nov 30 20:02 chromosomes.tsv      < 1st sql dump
> 7614040 Nov 30 20:02 features.tsv         < 2nd sql dump
>
> These are produced from above two:
>  126766 Nov 30 20:02 chadofeat-scerchrI.tsv
>  505105 Nov 30 20:02 chadofeat-scerchrII.tsv
>  ...
>
> -=- Don
> .
> -- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
> -- [hidden email]--http://marmot.bio.indiana.edu/
--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         [hidden email]
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [Gmod-schema] Re: GMODTools package preview

Don Gilbert
In reply to this post by Don Gilbert
Scott,

Thanks; I'll add something like a 'make clean'. In fact a significant
amount of effort for complex genome databases goes into checking and
editing those initial sql dump files to correct them and
ensure all genome features are properly represented (in my experience).

So "make clean" has been less interesting than "make with intelligent
corrections", which ends up being a person's effort rather than a
software design issue.

One of the hidden issues with Chado genome databases is that, unless
you are working with a very simply populated database (e.g. GFF input),
and unless/until there is a standardized way to put everything into
such a database so that software can follow it, it is a chore to know
if you have extracted all relevant feature/sequence information.
One also wants the ability to make corrections at this stage in
producing data for public consumption - correcting non-standard (non-SO)
terms, rearranging/correcting names, etc.  

- Don
..

-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- [hidden email]--http://marmot.bio.indiana.edu/


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema