Quantcast

Re: Lucene-lite : A GBrowse GFF data adaptor

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lucene-lite : A GBrowse GFF data adaptor

Don Gilbert
Lincoln,

I'll write a bit about lucene and lucegene adaptors for use with GBrowse as per
one of your suggestions.  Here appended is my just-finished adaptor comparison summary.
As you can tell, I'm trying to get others to take a look at Lucene/Lucegene
for genome databases.  It makes a lot of use-sense to me, and I'm hoping
the lucene-lite perl module will show hesitant perl-centric
folks it can play well in a mixed-language database environment.

-- Don


Lucene outperforms MySQL, BerkeleyDB, and PostgreSQL for
genome map database searches.

GBrowse (Generic Genome Browser, http://www.gmod.org/) is a widely
used program for displaying maps of genome data in
biology/bioinformatics. One need it serves is helping biologists
quickly and easily locate features of interest among 10s of millions
of genome features for an organism.

Lucene and the Lucegene project using it, find a good application for
rapidly and easily searching the complex, diverse and large volume of
genome data.  These are useful for searching genome sequences,
literature and experimental data, interactions among genes, as well
as other categories of genome informations.  Lucegene leverages
the speed, high-volume capability and data-source adaptability of Lucene
for searching the multi-gigabyte bioinformatics databases.

Though focused more on text searches and less on numerics, the
opposite of relational databases, Lucene is capable also at numeric
searches such as the demanding use with genomes for displaying
quickly to biologists the locations of their favorite genes and other
features among millions of features spread across 100 millions of
possible locations.

  Time (seconds) for GBrowse web display, 30 iterations
  at different map locations on fruitfly (dmel) genome
  ----------------------------------------------------------
                        Server3      Server2     Relative
    GBrowse-Adaptor   Mean    SE    Mean   SE   time (ave.)  
  dmel_lucegene_500k   5.4   0.15   1.86  0.05    100  
  dmel_lucene_500k     6.1   0.13   2.23  0.05    117    
  dmel_mysql_500k      7.9   0.31   2.14  0.06    128  
  dmel_bdb_500k        8.3   0.53   4.10  0.32    187  
  dmel_chadofc_500k   25.9   0.91   9.86  0.77    510  
  ----------------------------------------------------------
 
This uses a 500kb map range; differences increase with map range.
These all use the same data. Most of the response time is used in
drawing maps, once features are extracted from the database. However
adaptor speed is one factor that can improve rapid displays. There are
slight differences in displays due to configurations and how adaptor
works, but no significant differences in the data returned by
adaptors. Lucene and MySQL indices are cross-platform shared here.
BerkeleyDB and Postgres cannot be, and had to be regenerated for each
server. Server2 is x64-Solaris-10 (yr2005), Server3 is ppc-MacOSX-10.3
(yr2004).

The fastest adaptor here, Lucegene, has algorithms tuned for genome
map range searches. The simple lucene adaptor is comparable directly
to the mysql and berkeleydb adaptors in operation, using Lucene as
persistant searchable data storage without Lucene-optimized functions.

These results, while not dramatic in the speed differences but for the
slow  Chado Postgres adaptor, add to the other values for this
cross-platform, Java-based system, even when combined with Perl-based
tools such as GBrowse. One important but difficult to measure factor
is the cost of management, where genome data are frequently updated
from diverse sources.  Installing Lucene for this use is a simple
matter of adding the Java library to map software.  Lucene databases
are easy to create from source data, and can be copied and shared
across computer systems, where compiled software and binary databases
usually need to be re-generated by informaticians.

GBrowse Perl Adaptor key:
  lucegene -  lucegene.pm GFF   (Lucene v1.9; Java 1.4/1.5)
  lucene   -  simple lucene.pm GFF (Lucene v1.9; Java 1.4/1.5)
  bdb      -  berkeleydb.pm GFF (BerkeleyDB v4.2)
  mysql    -  mysqlopt.pm GFF   (MySQL v4.0x)
  chadofc  -  chado.pm DAS, modified for flybase Chado db (Postgres v7 & 8)
These are available through GMOD projects for use with GBrowse.

Preliminary tests suggest that Lucene may outperform
Lion Bioscience's SRS at basic bio-databank search and retrieval, such as
with Uniprot database.

See also
http://sourceforge.net/mailarchive/forum.php?thread_id=8094404&forum_id=31947
http://www.gmod.org/, http://www.gmod.org/lucegene/,
and http://lucene.apache.org/

The archive at ftp://ftp.eugenes.org/eugenes/gbrowse/
has a set of Lucene indices of genomes for Worm, Yeast, Rice,
and 9 Fruitfly species, along with Gbrowse configuration files. You
should be able to copy these, add to Gbrowse the Lucene-lite and
Lucegene adaptors, and display the genomes from your favorite
server computer.

Example servers with these data and comparisons to other
GBrowse adapators (Chado-Pg, MySQL, BerkeleyDB) are here:
 http://server2.eugenes.org/gbrowse/  (Sun-Solaris-x64)
 http://server3.eugenes.org/gbrowse/  (Apple-MacOSX-ppc)

-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- [hidden email]--http://marmot.bio.indiana.edu/


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Lucene-lite : A GBrowse GFF data adaptor

Don Gilbert
> The benchmarks are quite telling! How does the Perl adaptor talk to the Java
library?

Two ways  - I'm using the fastest for the tests, which is running Lucene or Lucegene
as a server (a simple Java socket server, not HTTP or other protocols).  THe Perl
module uses Socket IO for that.  The other way is just to launch the Java application
for each Gbrowse map invocation (and use a pipe call between perl and java).  This
is a bit slower than the  socket server, but not dramatically so, and can be used
effectively by folks wanting the minimal management hassles.

For Lucene-lite adaptor, the Java source files (3 - Indexer, Searcher and Socket server)
are about a page of code each - minimal and hopefully readable. The perl code for
using this is a few subroutines.  Most of 'lucene.pm' is copied from berkeleydb.pm
but for changes to the search backend calls.

- Don


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Lucene-lite : A GBrowse GFF data adaptor

Dmitri Bichko
In reply to this post by Don Gilbert
Quick question: do these GFF modules use Lincoln's binning scheme when
doing searches for 500Kb and larger?  I've done a few tests a while ago
which indicate that 250Kb-500Kb is about the point where a straight
(chr,start,stop) index starts to outperform the binning, and the
difference grows with size.

Also, are the DB tables CLUSTERed on the positional index?

It seems a little odd that we can't beat a general purpose indexing
engine with a highly tuned one with a lot of knowledge of the underlying
data.

Oh and just a nitpick: it would be nice to see a benchmark for Postgres
when it's not coupled with the slow (for this purpose) Chado schema.  To
a lot of people these benchmarks will perpetuate the myth that Postgres
is slower than MySQL/BDB for this sort of thing (I'm a big Postgres
groupie).

Dmitri

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf
> Of Don Gilbert
> Sent: Friday, September 02, 2005 5:58 PM
> To: [hidden email]
> Subject: Re: [Gmod-gbrowse] Lucene-lite : A GBrowse GFF data adaptor
>
>
> Lincoln,
>
> I'll write a bit about lucene and lucegene adaptors for use
> with GBrowse as per one of your suggestions.  Here appended
> is my just-finished adaptor comparison summary. As you can
> tell, I'm trying to get others to take a look at
> Lucene/Lucegene for genome databases.  It makes a lot of
> use-sense to me, and I'm hoping the lucene-lite perl module
> will show hesitant perl-centric folks it can play well in a
> mixed-language database environment.
>
> -- Don
>
>
> Lucene outperforms MySQL, BerkeleyDB, and PostgreSQL for
> genome map database searches.
>
> GBrowse (Generic Genome Browser, http://www.gmod.org/) is a
> widely used program for displaying maps of genome data in
> biology/bioinformatics. One need it serves is helping
> biologists quickly and easily locate features of interest
> among 10s of millions of genome features for an organism.
>
> Lucene and the Lucegene project using it, find a good
> application for rapidly and easily searching the complex,
> diverse and large volume of genome data.  These are useful
> for searching genome sequences, literature and experimental
> data, interactions among genes, as well as other categories
> of genome informations.  Lucegene leverages the speed,
> high-volume capability and data-source adaptability of Lucene
> for searching the multi-gigabyte bioinformatics databases.
>
> Though focused more on text searches and less on numerics,
> the opposite of relational databases, Lucene is capable also
> at numeric searches such as the demanding use with genomes
> for displaying quickly to biologists the locations of their
> favorite genes and other features among millions of features
> spread across 100 millions of possible locations.
>
>   Time (seconds) for GBrowse web display, 30 iterations
>   at different map locations on fruitfly (dmel) genome
>   ----------------------------------------------------------
>                         Server3      Server2     Relative
>     GBrowse-Adaptor   Mean    SE    Mean   SE   time (ave.)  
>   dmel_lucegene_500k   5.4   0.15   1.86  0.05    100  
>   dmel_lucene_500k     6.1   0.13   2.23  0.05    117    
>   dmel_mysql_500k      7.9   0.31   2.14  0.06    128  
>   dmel_bdb_500k        8.3   0.53   4.10  0.32    187  
>   dmel_chadofc_500k   25.9   0.91   9.86  0.77    510  
>   ----------------------------------------------------------
>  
> This uses a 500kb map range; differences increase with map
> range. These all use the same data. Most of the response time
> is used in drawing maps, once features are extracted from the
> database. However adaptor speed is one factor that can
> improve rapid displays. There are slight differences in
> displays due to configurations and how adaptor works, but no
> significant differences in the data returned by adaptors.
> Lucene and MySQL indices are cross-platform shared here.
> BerkeleyDB and Postgres cannot be, and had to be regenerated
> for each server. Server2 is x64-Solaris-10 (yr2005), Server3
> is ppc-MacOSX-10.3 (yr2004).
>
> The fastest adaptor here, Lucegene, has algorithms tuned for
> genome map range searches. The simple lucene adaptor is
> comparable directly to the mysql and berkeleydb adaptors in
> operation, using Lucene as persistant searchable data storage
> without Lucene-optimized functions.
>
> These results, while not dramatic in the speed differences
> but for the slow  Chado Postgres adaptor, add to the other
> values for this cross-platform, Java-based system, even when
> combined with Perl-based tools such as GBrowse. One important
> but difficult to measure factor is the cost of management,
> where genome data are frequently updated from diverse
> sources.  Installing Lucene for this use is a simple matter
> of adding the Java library to map software.  Lucene databases
> are easy to create from source data, and can be copied and
> shared across computer systems, where compiled software and
> binary databases usually need to be re-generated by informaticians.
>
> GBrowse Perl Adaptor key:
>   lucegene -  lucegene.pm GFF   (Lucene v1.9; Java 1.4/1.5)
>   lucene   -  simple lucene.pm GFF (Lucene v1.9; Java 1.4/1.5)
>   bdb      -  berkeleydb.pm GFF (BerkeleyDB v4.2)
>   mysql    -  mysqlopt.pm GFF   (MySQL v4.0x)
>   chadofc  -  chado.pm DAS, modified for flybase Chado db
> (Postgres v7 & 8) These are available through GMOD projects
> for use with GBrowse.
>
> Preliminary tests suggest that Lucene may outperform
> Lion Bioscience's SRS at basic bio-databank search and
> retrieval, such as with Uniprot database.
>
> See also
> http://sourceforge.net/mailarchive/forum.php?thread_id=8094404
&forum_id=31947
http://www.gmod.org/, http://www.gmod.org/lucegene/,
and http://lucene.apache.org/

The archive at ftp://ftp.eugenes.org/eugenes/gbrowse/
has a set of Lucene indices of genomes for Worm, Yeast, Rice, and 9
Fruitfly species, along with Gbrowse configuration files. You should be
able to copy these, add to Gbrowse the Lucene-lite and Lucegene
adaptors, and display the genomes from your favorite server computer.

Example servers with these data and comparisons to other GBrowse
adapators (Chado-Pg, MySQL, BerkeleyDB) are here:
http://server2.eugenes.org/gbrowse/  (Sun-Solaris-x64)
http://server3.eugenes.org/gbrowse/  (Apple-MacOSX-ppc)

-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- [hidden email]--http://marmot.bio.indiana.edu/


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle
Practices Agile & Plan-Driven Development * Managing Projects & Teams *
Testing & QA Security * Process Improvement & Measurement *
http://www.sqe.com/bsce5sf
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Lucene-lite : A GBrowse GFF data adaptor

Dmitri Bichko
In reply to this post by Don Gilbert
Have you tried with the perl port of Lucene
(http://search.cpan.org/~tmtm/Plucene-1.24/lib/Plucene.pm)?  It's
definitely slower than Lucene (being a straight "translation"), but that
might be cancelled out by having to start the VM.  The java server would
still be faster, but some people might like a native library that
doesn't need a server.

I believe the indices are compatible up to version 1.3 of Lucene.

Dmitri

> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf
> Of Don Gilbert
> Sent: Friday, September 02, 2005 6:13 PM
> To: [hidden email]; [hidden email]
> Cc: [hidden email]
> Subject: Re: [Gmod-gbrowse] Lucene-lite : A GBrowse GFF data adaptor
>
>
> > The benchmarks are quite telling! How does the Perl adaptor talk to
> > the Java
> library?
>
> Two ways  - I'm using the fastest for the tests, which is
> running Lucene or Lucegene as a server (a simple Java socket
> server, not HTTP or other protocols).  THe Perl module uses
> Socket IO for that.  The other way is just to launch the Java
> application for each Gbrowse map invocation (and use a pipe
> call between perl and java).  This is a bit slower than the  
> socket server, but not dramatically so, and can be used
> effectively by folks wanting the minimal management hassles.
>
> For Lucene-lite adaptor, the Java source files (3 - Indexer,
> Searcher and Socket server) are about a page of code each -
> minimal and hopefully readable. The perl code for using this
> is a few subroutines.  Most of 'lucene.pm' is copied from
> berkeleydb.pm but for changes to the search backend calls.
>
> - Don
>
>
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference &
> EXPO September 19-22, 2005 * San Francisco, CA * Development
> Lifecycle Practices Agile & Plan-Driven Development *
> Managing Projects & Teams * Testing & QA Security * Process
> Improvement & Measurement * http://www.sqe.com/bsce5sf 
> _______________________________________________
> Gmod-gbrowse mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
>
The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Lucene-lite : A GBrowse GFF data adaptor

Don Gilbert
In reply to this post by Don Gilbert

Dimitri,

The perl port, and c port, of lucene are lagging behind the java version.
The java vm startup time isn't that much of a cost.  I've tried the gjc -
gnu compiled java to executable, which does away w/ vm startup, and it doesn't
really speed up things, and isn't cross-platform compatible.

You don't need to use the server-variant of lucene-gbrowse; it just cuts out
some of the startup-time.  Otherwise running the adaptor as a command-line
java invocation is pretty speedy.

- Don
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- [hidden email]--http://marmot.bio.indiana.edu/


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Lucene-lite : A GBrowse GFF data adaptor

Don Gilbert
In reply to this post by Don Gilbert

|From [hidden email]  Tue Sep  6 11:06:23 2005
|
|Quick question: do these GFF modules use Lincoln's binning scheme when
|doing searches for 500Kb and larger?  I've done a few tests a while ago
|which indicate that 250Kb-500Kb is about the point where a straight
|(chr,start,stop) index starts to outperform the binning, and the
|difference grows with size.

The lucene-lite variant uses same binning as BerkeleyDB.pm (same I think
as mysql-gff).  The lucegene one uses a lucene-specific binning field
which I think is where it gets some of its speed over others.  I have
tested at large > 1MB search sizes, but don't think I benchmarked them
(at that size, display gets crowded w/ features, so it isn't as much
of a real-world test). I could run such tests - maybe will.

|Also, are the DB tables CLUSTERed on the positional index?

I didn't do any special tuning of any of the underlying software (mysql,
postgres, berkelye db or lucene). I did use for mysql the 'recommended'
medium or large config setup, and for postgres I've used tuning parameters
suggested over the last few years with chado db developers. One of the
values of using lucene, I think, is good performance w/o special knowledge
or expertise in setting up such a database.

|It seems a little odd that we can't beat a general purpose indexing
|engine with a highly tuned one with a lot of knowledge of the underlying
|data.
|
|Oh and just a nitpick: it would be nice to see a benchmark for Postgres
|when it's not coupled with the slow (for this purpose) Chado schema.  To
|a lot of people these benchmarks will perpetuate the myth that Postgres
|is slower than MySQL/BDB for this sort of thing (I'm a big Postgres
|groupie).

As with all benchmarks, there is always another way to test to show
different advantages/disadvantages.  I tried to use a 'fair' real world
test of GBrowse as users would typically use it. Yes, I'm sure most of
the Chado-postgres slowness is due to relatively complex chado schema,
and the cost of doing multiple joins, etc. to get the right feature data.

-- Don

-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- [hidden email]--http://marmot.bio.indiana.edu/


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Lucene-lite : A GBrowse GFF data adaptor

Don Gilbert
In reply to this post by Don Gilbert

Silly me, I should have compared speeds with and without
server option more carefully (I 'eyeballed' it first time).
The Lucene-lite gbrowse adaptor runs just about the same speed
using command-line invokations of java as using it thru a separately
started server.

                   mean time (secs)  S.E.M.
dmelb4_lucene_500k     4.655     0.13      = server variant
dmelb4_lucene_sa_500k  4.696     0.09      sa = stand-alone variant
Run on Server3.eugenes.org (MacOSX java 1.4; other JVMs may differ)

-- Don

|From [hidden email]  Tue Sep  6 10:46:34 2005
..
| The java server would
|still be faster, but some people might like a native library that
|doesn't need a server.
|
|
|> -----Original Message-----
|> > The benchmarks are quite telling! How does the Perl adaptor talk to
|> > the Java library?
|>
|> Two ways  - I'm using the fastest for the tests, which is
|> running Lucene or Lucegene as a server (a simple Java socket
|> server, not HTTP or other protocols).  



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Gmod-gbrowse mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
Loading...