Talk:Binary GCD algorithm

This is the talk page for discussing improvements to the Binary GCD algorithm article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Archives: 1: 2 years

This is the talk page for discussing improvements to the Binary GCD algorithm article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Archives: 1: 2 years

Computing Unassessed

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing
???	This article has not yet received a rating on Wikipedia's content assessment scale.
???	This article has not yet received a rating on the project's importance scale.

ARM implementation

The ARM implementation lacks the initial u = 0 test. The code also does not appear to make good use of the conditional instruction set, so I have some optimizations to suggest: — Preceding unsigned comment added by 80.175.250.218 (talk) 17:09, 31 October 2005 (UTC)[reply]

remove_twos_loop optimization

gcd:
    @ Arguments arrive in registers r0 and r1
    mov     r3, #0            @ Initialize r3, the power of 2 in the result
    orr     r12, r0, r1       @ r12 = (r0 | r1)
    tst     r12, #1           @ Test least significant bit of (r0 | r1)
    bne     gcd_loop          @ If nonzero, either r0 or r1 are odd, jump into middle of next loop
remove_twos_loop:
    mov     r0, r0, lsr #1    @ Shift r0 right, dividing it by 2
    mov     r1, r1, lsr #1    @ Shift r1 right, dividing it by 2
    add     r3, r3, #1        @ Increment r3
    tst     r0, #1            @ Check least significant bit of r0
    bne     gcd_loop          @ If nonzero, r0 is odd, jump into middle of next loop
    tst     r1, #1            @ Check least significant bit of r1
    beq     remove_twos_loop  @ If zero, r0 and r1 still even, keep looping

optimization 1

gcd:
    @ Arguments arrive in registers r0 and r1
    mov     r3, #0            @ Initialize r3, the power of 2 in the result
    orr     r12, r0, r1       @ r12 = (r0 | r1)
remove_twos_loop:
    tst     r12, #1           @ Test least significant bit of (r0 | r1)
    moveq   r0, r0, lsr #1    @ Shift r0 right, dividing it by 2
    moveq   r1, r1, lsr #1    @ Shift r1 right, dividing it by 2
    moveq   r12, r12, lsr #1  @ Shift r12 right, dividing it by 2
    beq     remove_twos_loop  @ If zero, r0 and r1 still even, keep looping

optimization 2

gcd:
    @ Arguments arrive in registers r0 and r1
    mov     r3, #0            @ Initialize r3, the power of 2 in the result
remove_twos_loop:
    tst     r0, #1            @ Test least significant bit of r0
    tsteq   r1, #1            @ Test least significant bit of r1
    moveq   r0, r0, lsr #1    @ Shift r0 right, dividing it by 2
    moveq   r1, r1, lsr #1    @ Shift r1 right, dividing it by 2
    beq     remove_twos_loop  @ If zero, r0 and r1 still even, keep looping

— Preceding unsigned comment added by 80.175.250.218 (talk) 17:09, 31 October 2005 (UTC)[reply]

Is that page obsolete ?

I ran 10 000 000 random tests on Mac OS X.6.4 with C Code and got the following results:

Euclide : 6180 ms
Stein : 9038 ms

and compiling with option -O3

Euclide : 4728 ms
Stein : 5302 ms

#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>

#define N 10000000
#define MAX_UINTplus1 0x100000000LLU
#define uint64 unsigned long long
#define INIT_RANDOM srandom(3)

uint64 pgcdEuclide(uint64,uint64);
uint64 pgcdStein(uint64,uint64);


int main(void) {
	int i;
	uint64 a,b,d,t0,t1;
	struct  timeval tv;
	
	// Version Euclide
	INIT_RANDOM;
	gettimeofday(&tv,0);
	t0 = tv.tv_sec* 1000 *1000+tv.tv_usec;
	for(i=0;i<N;i++){
		a=random()*MAX_UINTplus1+random();
		b=random()*MAX_UINTplus1+random();		
		d=pgcdEuclide(a,b);
	}	
	gettimeofday(&tv,0);
	t1 = tv.tv_sec*1000*1000+tv.tv_usec;	
	printf("%llu,Euclide : %llu ms\n",d,(t1-t0)/1000);
	
	// Version Stein
	INIT_RANDOM;
	gettimeofday(&tv,0);
	t0 = tv.tv_sec* 1000 *1000+tv.tv_usec;
	for(i=0;i<N;i++){
		a=(uint64)random()*MAX_UINTplus1+random();
		b=(uint64)random()*MAX_UINTplus1+random();		
		d=pgcdStein(a,b);
	}	
	gettimeofday(&tv,0);
	t1 = tv.tv_sec*1000*1000+tv.tv_usec;	
	printf("%llu,Stein : %llu ms\n",d,(t1-t0)/1000);	
	
	return 0;
}

uint64 pgcdEuclide(uint64 a,uint64 b){
	if(b==0) return a;
	return pgcdEuclide(b,a%b);
}

uint64 pgcdStein(uint64 u,uint64 v){
		int shift;
		
		/* GCD(0,x) := x */
		if (u == 0 || v == 0)
			return u | v;
		
		/* Let shift := lg K, where K is the greatest power of 2
		 dividing both u and v. */
		for (shift = 0; ((u | v) & 1) == 0; ++shift) {
			u >>= 1;
			v >>= 1;
		}
		
		while ((u & 1) == 0)
			u >>= 1;
		
		/* From here on, u is always odd. */
		do {
			while ((v & 1) == 0)  /* Loop X */
				v >>= 1;
			
			/* Now u and v are both odd, so diff(u, v) is even.
			 Let u = min(u, v), v = diff(u, v)/2. */
			if (u < v) {
				v -= u;
			} else {
				uint64 diff = u - v;
				u = v;
				v = diff;
			}
			v >>= 1;
		} while (v != 0);
		
		return u << shift;
}

This is terrrible because mostly the random number generator is being timed. McKay (talk) 07:56, 11 July 2012 (UTC)[reply]

RNGs are super slow. Also note that you only went up to numbers sized 2 ** 64, meaning that arithmetic operations will take constant time if you're on a 64 bit hardware/os/etc. To do an emperical comparison you should use a bigint library (eg, ints in Python) and test numbers which are >> your CPU's native ALU (or implement your own arithmetic routines so the processor is not used directly) — Preceding unsigned comment added by 131.215.176.115 (talk) 18:06, 16 March 2013 (UTC)[reply]

As McKay mentionned, you are timing your RNG so the microbenchmark is invalid. I made a more-precise benchmark using the criterion statistical tool, when evaluating Stein's algorithm for use in an integer factoring application, and found it to be ~60% faster than Euclid's, on random 64b numbers. Interestingly, this matches the theory (Stein's algorithm uses 60% fewer bit operations than Euclid's^[1]) Nicoonoclaste (talk) 10:06, 3 November 2020 (UTC)[reply]

ctz version

I think the unreferenced section "Iterative version in C++ using ctz (count trailing zeros)" should also be removed. It makes vague claims of efficiency, but as far as I can tell these rely on hardware parallelism for ctz. If you look at Count trailing zeros, the typical (x86) ctz instructions have limited bit width, so the claims to improved performance suffer from the same problem as the ARM assembly example, i.e. they are not asymptotic. Someone not using his real name (talk) 14:26, 2 March 2014 (UTC)[reply]

Totally agreed, went ahead and WP:BOLDly deleted this section. Count trailing zeros has more to do with compilers using the instruction as part of the platform optimization, rather than with a C++ implementation which may or may not end up using that instruction. — Dsimic (talk | contribs) 17:44, 2 March 2014 (UTC)[reply]

Keep. The article confuses the notion of an algorithm with the bit-bumming aspects of programming. Using machine arithmetic types (int) to illustrate the algorithm is one thing, but it should not be taken so literally. The algorithm and not the programming is the important thing. The cycles in a hardware divide instruction are not interesting. The availability of an x86 instruction to count zeros is also uninteresting. GCD on arbitrary-precision numbers (bignums) is a different story; think million-digit numbers. Binary GCD can be done with shifts and adds -- operations that are linear in the number of bits. Division is not. It may be that bignum packages use binary GCD for that reason. CTZ applied to the last digit of a bignum would offer a reasonable speed optimization (and one could use a lookup table to implement CTZ). I think there's value. Glrx (talk) 23:27, 2 March 2014 (UTC)[reply]

People looking for a fast way to compute GCD, are disappointed by this article, when they find out that the C code is slower than euclidean algorithm. It would be nice to point out that: Actually this C code may be faster than euclidean algorithm, when replacing while((v&1)==0) v>>=1; by v>>=__builtin_ctz(v); This makes sense even for big numbers, since time of the new version is linear in the number of bits, while time of the first version is multiplied by the number of trailing zeros. Of course the actual reason why it is faster, is the removal of a short loop which is made twice in average, and involves a lot of mispredicted branches. But there is no need to mention it. It would just be cool to have a really fast function. — Preceding unsigned comment added by 78.196.87.18 (talk) 03:39, 12 April 2020 (UTC)[reply]

78.196.87.18, unfortunately there is no way for a C implementation to be both portable and efficient (__builtin_ctz isn't part of the standard). That's part of why I replaced the iterative implementation with a Rust one, as it lends itself to faster and more-readable code while still being portable. Nicoonoclaste (talk) 10:15, 3 November 2020 (UTC)[reply]

References

^ Akhavi, Ali; Vallée, Brigitte (2000), "Average Bit-Complexity of Euclidean Algorithms", Proceedings ICALP'00, Lecture Notes Computer Science 1853: 373–387, CiteSeerX: 10.1.1.42.7616, archived from the original on 2006-10-02

[1] Akhavi, Ali; Vallée, Brigitte (2000), "Average Bit-Complexity of Euclidean Algorithms", Proceedings ICALP'00, Lecture Notes Computer Science 1853: 373–387, CiteSeerX: 10.1.1.42.7616, archived from the original on 2006-10-02

[1]