PRIMARY KEYs should be of type `bigint` [CASCADE]

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

PRIMARY KEYs should be of type `bigint` [CASCADE]

Mara Kim-2

Hello gmod-ers,

PostgreSQL uses bigint to manage SEQUENCEs (the data structure used to maintain the serial data type. This should be paired with a column of type bigint, which is denoted by the type name bigserial in a CREATE TABLE declaration. Without this, eventually INSERTs will fail with the message ERROR: integer out of range as the serial begins to return ids that are too large for the column type.

This is a change that would have far reaching implications, affecting all columns that are foreign keys.

The following are related to this issue:

  • ALTER TYPE of feature.seqlen to bigint
  • ALTER TYPE of featureloc.fmin, featureloc.fmax to bigint

I know that this may be a controversial change.  However, as the size of our data sets grow, there will be an ever increasing need for this update.  I have created a GitHub issue for this here:  https://github.com/GMOD/Chado/issues/1

--
Mara Kim

Ph.D. Candidate
Computational Biology
Vanderbilt University
Nashville, TN

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: PRIMARY KEYs should be of type `bigint` [CASCADE]

Karl O. Pinc
On Mon, 20 Apr 2015 14:57:40 -0500
Mara Kim <[hidden email]> wrote:

> PostgreSQL uses bigint to manage SEQUENCEs (the data structure used to
> maintain the serial data type.

> Without this, eventually INSERTs will fail with
> the message ERROR: integer out of range as the serial begins to
> return ids that are too large for the column type.

What makes the problem worse is that rolling back transactions
consumes ids.

> *This is a change that would have far reaching implications,
> affecting all columns that are foreign keys.*

Agreed, and I don't see any way around the problem.
It seems like eventually bigints must be used.

The good news is that this change does not need to be
made to _every_ primary key.  The only ones that come
to mind are FEATURE.Feature_Id and FEATURELOC.Featureloc_Id.

> The following are related to this issue:
>
>    - ALTER TYPE of feature.seqlen to bigint
>    - ALTER TYPE of featureloc.fmin, featureloc.fmax to bigint

Wow.  There really are features that are more than 2 billion
long?





Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: PRIMARY KEYs should be of type `bigint` [CASCADE]

Mara Kim-2

On 4:42PM, Mon, Apr 20, 2015 Karl O. Pinc <[hidden email]> wrote:

The good news is that this change does not need to be
made to _every_ primary key.  The only ones that come
to mind are FEATURE.Feature_Id and FEATURELOC.Featureloc_Id.

That is true. We could get away with starting there. However, I first encountered this problem on featureprop, which makes sense as this table has a 1-to-many relationship with feature. We would need to change every table with either a 1-to-1 or 1-to-many relationship, which would pull in all the feature_* linker tables as well as analysisfeature.

Wow.  There really are features that are more than 2 billion long?

Not sure about the present, but it's only a matter of time.  The largest sequenced genome is currently Pinus taeda at 23.18 Gbp over 12 chromosomes.  In the best case scenario where these are distributed equally over all chromosomes, that gives a chromosome size of 1.93 Gbp, which is uncomfortably close.  Paris japonica is known to have a genome size of ~150 Gbp.


------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: [GMOD-devel] PRIMARY KEYs should be of type `bigint` [CASCADE]

Lacey Sanderson
I would think the primary key of the nd_experiment tables would be most likely to cause an issue here (remember not all of us are feature-centric), since this table will have a record for each phenotypic measurement and each allele call stored using the natural diversity tables…

~Lacey

------------------------------------------------------
Lacey-Anne Sanderson
Bioinformaticist
Pulse Crop Breeding and Genetics
Phone: (306) 966-3208
Room 2C33 Agriculture
Department of Plant Sciences
University of Saskatchewan

On April 21, 2015 at 12:02:06 PM, Mara Kim ([hidden email]) wrote:

On 4:42PM, Mon, Apr 20, 2015 Karl O. Pinc <[hidden email]> wrote:

The good news is that this change does not need to be
made to _every_ primary key.  The only ones that come
to mind are FEATURE.Feature_Id and FEATURELOC.Featureloc_Id.

That is true. We could get away with starting there. However, I first encountered this problem on featureprop, which makes sense as this table has a 1-to-many relationship with feature. We would need to change every table with either a 1-to-1 or 1-to-many relationship, which would pull in all the feature_* linker tables as well as analysisfeature.

Wow.  There really are features that are more than 2 billion long?

Not sure about the present, but it's only a matter of time.  The largest sequenced genome is currently Pinus taeda at 23.18 Gbp over 12 chromosomes.  In the best case scenario where these are distributed equally over all chromosomes, that gives a chromosome size of 1.93 Gbp, which is uncomfortably close.  Paris japonica is known to have a genome size of ~150 Gbp.

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF_______________________________________________
Gmod-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-devel

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: PRIMARY KEYs should be of type `bigint` [CASCADE]

Karl O. Pinc
In reply to this post by Mara Kim-2
On Tue, 21 Apr 2015 12:50:32 -0500
Mara Kim <[hidden email]> wrote:

> Wow.  There really are features that are more than 2 billion long?
>
> Not sure about the present, but it's only a matter of time.  The
> largest sequenced genome is currently *Pinus taeda* at 23.18 Gbp over
> 12 chromosomes.  In the best case scenario where these are
> distributed equally over all chromosomes, that gives a chromosome
> size of 1.93 Gbp, which is uncomfortably close.  *Paris japonica *is
> known to have a genome size of ~150 Gbp.

The bad news is that having featureloc be bigint could really
affect performance, as, potentially, does moving some of the
keys to bigint.

Step 1 would then be to measure just what the performance impact
is.  No point in making a fuss if it does not matter.
My back-of-the-pants feeling is that it will matter, especially
as you start to get into billions of rows.

If it does matter then perhaps whether to use bigints could
be an install-time choice.  At least in some cases.
I can't see at lot of current chado users ever wanting
bigints for feature offsets, as critical as it might be
for some.  (We are "feature oriented" here and
wouldn't want to see our storage needs balloon, or
increase the amount of time it takes to process
the storage.)

Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: PRIMARY KEYs should be of type `bigint` [CASCADE]

Siddhartha Basu
In reply to this post by Mara Kim-2
On Tue, 21 Apr 2015, Mara Kim wrote:

>    On 4:42PM, Mon, Apr 20, 2015A Karl O. Pinc <[hidden email]> wrote:
>
>      The good news is that this change does not need to be
>      made to _every_ primary key.A  The only ones that come
>      to mind are FEATURE.Feature_Id and FEATURELOC.Featureloc_Id.
>
>    That is true. We could get away with starting there. However, I first
>    encountered this problem on featureprop, which makes sense as this table
>    has a 1-to-many relationship with feature. We would need to change every
>    table with either a 1-to-1 or 1-to-many relationship, which would pull in
>    all the feature_* linker tables as well as analysisfeature.
>
>      Wow.A  There really are features that are more than 2 billion long?
>
>    Not sure about the present, but it's only a matter of time.A  The largest
>    sequenced genome is currentlyA Pinus taedaA at 23.18 Gbp over 12
>    chromosomes.A  In the best case scenario where these are distributed
>    equally over all chromosomes, that gives a chromosome size of 1.93 Gbp,
>    which is uncomfortably close. A Paris japonica is known to have a genome
>    size of ~150 Gbp.
That's very interesting information indeed. However, i think we have
to consider performance with that gigantic number of rows. The indexes
going to be bloated and all the queries should also be affected. I think
it will require some degree of table partitioning/sharding and maybe
even host partitioning depending on the cases. Whatever it is that will
add some administrative overhead/maintenance. Though its a change in
datatype, additional database setup/architecture is a prerequisite for
performance reason. Now since this all in theory(in my head), have
somebody actually dealt with performance related to large number of
rows. Would be interesting to know.

Secondly, it really does not matter if i start from scratch with bigint,
but what about existing installation.
Is it just executing bunch of alter datatype queries
and everything will be bigint, no issues whatsoever. Is
there is any precaution have to be taken.

thanks,
-siddhartha


> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF

> _______________________________________________
> Gmod-schema mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/gmod-schema


------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema
Reply | Threaded
Open this post in threaded view
|

Re: PRIMARY KEYs should be of type `bigint` [CASCADE]

Karl O. Pinc
On Thu, 23 Apr 2015 14:07:14 -0500
Siddhartha Basu <[hidden email]> wrote:

> Secondly, it really does not matter if i start from scratch with
> bigint, but what about existing installation.
> Is it just executing bunch of alter datatype queries
> and everything will be bigint, no issues whatsoever. Is
> there is any precaution have to be taken.

Without having done it, I'd say the main precaution
is ensuring that when a key is changed to bigint
that all the related foreign keys are also changed.

I suppose that if you miss one and try to insert a
bigint value into a int column you'll just get a
integer out of range error and the transaction will
roll back.  So, as long as you've got sensible
transactions things would be ok and you'd just
have to alter the foreign key to bigint and
retry the transaction.

So, the main concern would be having sensible
transactions in your applications in case you goof.

Given performance concerns, perhaps the best approach
is to let people convert specific foreign keys to bigint on
an as-needed basis.


Karl <[hidden email]>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Gmod-schema mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/gmod-schema