Talk:Binary GCD algorithm
This is the talk page for discussing improvements to the Binary GCD algorithm article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: 1Auto-archiving period: 2 years ![]() |
This is the talk page for discussing improvements to the Binary GCD algorithm article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: 1Auto-archiving period: 2 years ![]() |
![]() | Computing Unassessed | |||||||||
|
Is that page obsolete ?
I ran 10 000 000 random tests on Mac OS X.6.4 with C Code and got the following results:
- Euclide : 6180 ms
- Stein : 9038 ms
and compiling with option -O3
- Euclide : 4728 ms
- Stein : 5302 ms
#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
#define N 10000000
#define MAX_UINTplus1 0x100000000LLU
#define uint64 unsigned long long
#define INIT_RANDOM srandom(3)
uint64 pgcdEuclide(uint64,uint64);
uint64 pgcdStein(uint64,uint64);
int main(void) {
int i;
uint64 a,b,d,t0,t1;
struct timeval tv;
// Version Euclide
INIT_RANDOM;
gettimeofday(&tv,0);
t0 = tv.tv_sec* 1000 *1000+tv.tv_usec;
for(i=0;i<N;i++){
a=random()*MAX_UINTplus1+random();
b=random()*MAX_UINTplus1+random();
d=pgcdEuclide(a,b);
}
gettimeofday(&tv,0);
t1 = tv.tv_sec*1000*1000+tv.tv_usec;
printf("%llu,Euclide : %llu ms\n",d,(t1-t0)/1000);
// Version Stein
INIT_RANDOM;
gettimeofday(&tv,0);
t0 = tv.tv_sec* 1000 *1000+tv.tv_usec;
for(i=0;i<N;i++){
a=(uint64)random()*MAX_UINTplus1+random();
b=(uint64)random()*MAX_UINTplus1+random();
d=pgcdStein(a,b);
}
gettimeofday(&tv,0);
t1 = tv.tv_sec*1000*1000+tv.tv_usec;
printf("%llu,Stein : %llu ms\n",d,(t1-t0)/1000);
return 0;
}
uint64 pgcdEuclide(uint64 a,uint64 b){
if(b==0) return a;
return pgcdEuclide(b,a%b);
}
uint64 pgcdStein(uint64 u,uint64 v){
int shift;
/* GCD(0,x) := x */
if (u == 0 || v == 0)
return u | v;
/* Let shift := lg K, where K is the greatest power of 2
dividing both u and v. */
for (shift = 0; ((u | v) & 1) == 0; ++shift) {
u >>= 1;
v >>= 1;
}
while ((u & 1) == 0)
u >>= 1;
/* From here on, u is always odd. */
do {
while ((v & 1) == 0) /* Loop X */
v >>= 1;
/* Now u and v are both odd, so diff(u, v) is even.
Let u = min(u, v), v = diff(u, v)/2. */
if (u < v) {
v -= u;
} else {
uint64 diff = u - v;
u = v;
v = diff;
}
v >>= 1;
} while (v != 0);
return u << shift;
}
This is terrrible because mostly the random number generator is being timed. McKay (talk) 07:56, 11 July 2012 (UTC)
RNGs are super slow. Also note that you only went up to numbers sized 2 ** 64, meaning that arithmetic operations will take constant time if you're on a 64 bit hardware/os/etc. To do an emperical comparison you should use a bigint library (eg, ints in Python) and test numbers which are >> your CPU's native ALU (or implement your own arithmetic routines so the processor is not used directly) — Preceding unsigned comment added by 131.215.176.115 (talk) 18:06, 16 March 2013 (UTC)
As McKay mentionned, you are timing your RNG so the microbenchmark is invalid. I made a more-precise benchmark using the criterion statistical tool, when evaluating Stein's algorithm for use in an integer factoring application, and found it to be ~60% faster than Euclid's, on random 64b numbers. Interestingly, this matches the theory (Stein's algorithm uses 60% fewer bit operations than Euclid's[1]) Nicoonoclaste (talk) 10:06, 3 November 2020 (UTC)
References
- ^ Akhavi, Ali; Vallée, Brigitte (2000), "Average Bit-Complexity of Euclidean Algorithms", Proceedings ICALP'00, Lecture Notes Computer Science 1853: 373–387, CiteSeerX: 10.1.1.42.7616, archived from the original on 2006-10-02
ctz version
I think the unreferenced section "Iterative version in C++ using ctz (count trailing zeros)" should also be removed. It makes vague claims of efficiency, but as far as I can tell these rely on hardware parallelism for ctz. If you look at Count trailing zeros, the typical (x86) ctz instructions have limited bit width, so the claims to improved performance suffer from the same problem as the ARM assembly example, i.e. they are not asymptotic. Someone not using his real name (talk) 14:26, 2 March 2014 (UTC)
- Totally agreed, went ahead and WP:BOLDly deleted this section. Count trailing zeros has more to do with compilers using the instruction as part of the platform optimization, rather than with a C++ implementation which may or may not end up using that instruction. — Dsimic (talk | contribs) 17:44, 2 March 2014 (UTC)
- Keep. The article confuses the notion of an algorithm with the bit-bumming aspects of programming. Using machine arithmetic types (
int
) to illustrate the algorithm is one thing, but it should not be taken so literally. The algorithm and not the programming is the important thing. The cycles in a hardware divide instruction are not interesting. The availability of an x86 instruction to count zeros is also uninteresting. GCD on arbitrary-precision numbers (bignums) is a different story; think million-digit numbers. Binary GCD can be done with shifts and adds -- operations that are linear in the number of bits. Division is not. It may be that bignum packages use binary GCD for that reason. CTZ applied to the last digit of a bignum would offer a reasonable speed optimization (and one could use a lookup table to implement CTZ). I think there's value. Glrx (talk) 23:27, 2 March 2014 (UTC)
People looking for a fast way to compute GCD, are disappointed by this article, when they find out that the C code is slower than euclidean algorithm. It would be nice to point out that:
Actually this C code may be faster than euclidean algorithm, when replacing while((v&1)==0) v>>=1;
by v>>=__builtin_ctz(v);
This makes sense even for big numbers, since time of the new version is linear in the number of bits, while time of the first version is multiplied by the number of trailing zeros.
Of course the actual reason why it is faster, is the removal of a short loop which is made twice in average, and involves a lot of mispredicted branches. But there is no need to mention it. It would just be cool to have a really fast function. — Preceding unsigned comment added by 78.196.87.18 (talk) 03:39, 12 April 2020 (UTC)
78.196.87.18, unfortunately there is no way for a C implementation to be both portable and efficient (__builtin_ctz
isn't part of the standard). That's part of why I replaced the iterative implementation with a Rust one, as it lends itself to faster and more-readable code while still being portable. Nicoonoclaste (talk) 10:15, 3 November 2020 (UTC)
mod 4 optimisation
I removed the following section, originally added by 64.44.80.252 :
Variants
As mentioned in the Rust code above, the difference of two odd values u − v is always even, so it may be divided by two unconditionally, or a following while loop to remove factors of two may be changed to a do while loop.
A common optimization is to add an additional reduction depending on the values of u and v modulo 4. If u ≡ v (mod 4), then their difference is divisible by 4 and the loop may immediately proceed to |u − v|/4. If they are unqual, then the sum must be a multiple of 4, and one may use (u + v)/4 instead.
The implementation must be careful to avoid integer overflow during the addition. Since it is known that (u mod 4) + (v mod 4) = 4, one way to compute the result is ⌊u/4⌋ + ⌊v/4⌋ + 1. A second is to compute (u − v)/2, and if it is odd, proceed to ((u − v)/2 + v)/2.
I did it because:
- The optimizations described are either a repetition of what's just above (moving the exit condition check to the end of the look) or subsumed by strictly more-powerful ones (working mod 4, whereas the Rust version just removes all power-2 factors in a single step: v >>= v.trailing_zeroes()
- It is doubtful that using mod 4 leads to any speedup:
* it prevents some optimizations mentioned in the article, like using branch-free code; * the conditionals involved will be hard-to-predict for a concrete machine's branch predictor; * the same number of arithmetic operations is performed.
- The claim is not supported by any citation, and I couldn't find anything supporting it with a quick web search, so this is looking like original research.