From - Wed Feb 5 10:41:13 1997 Path: unixg.ubc.ca!news.bc.net!news.maxwell.syr.edu!news.bbnplanet.com!cpk-news-hub1.bbnplanet.com!worldnet.att.net!uunet!in1.uu.net!204.176.216.2!pps.com!usenet From: Munafo@prepress.pps.com Newsgroups: sci.fractals,alt.fractals,comp.lang.pascal.delphi.misc Subject: Re: Integer mandelbrot SLOWER than real code? (Win32 MulDiv) Date: Tue, 04 Feb 1997 18:11:40 +0000 Organization: PrePRESS SOLUTIONS Lines: 81 Message-ID: <32F77BDC.3AD@prepress.pps.com> References: <32f397fd.13420729@news1> <32f64ce6.2016681@news.mira.net.au> NNTP-Posting-Host: 192.168.1.144 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Mailer: Mozilla 3.0 (Macintosh; I; 68K) Xref: unixg.ubc.ca sci.fractals:10980 alt.fractals:1346 comp.lang.pascal.delphi.misc:88055 Roger Riordan wrote: > If you want anything remotely resembling performance you must wite the > kernal in assembler. [...] > Last time I tried (on a 120Mhz pentium) it took about 1 usec per > pass. It seems that on a PowerPC you can do a lot better than this without even using assembler. There are enough registers to avoid memory references entirely. I would think this would be true on a Pentium too since the Pentium is supposed to be as fast. On a 66-MHz PowerPC, I use the following loop and get 3.8 million iterations per second (which is 1.9 million times through the loop because it does two iterations each time through). That's 26.6 megaFLOPS since one Mandelbrot iteration is 7 floating point operations. On a 200 MHz PowerPC it would be about 3 times faster since there are no memory accesses and the instructions fit in cache. // I had to translate this a bit to remove the specifics of my // program, so there might be some minor errors. // long mandel_its(float cr, float ci) { float zr, zi, zr2, zi2, zr3, zi3; float k2, zmax2; long i, imax; // Init constants k2 = 2.0; zmax2 = 4.0; // cr and ci are the point to iterate zr = cr; zi = ci; // imax is the number of iterations imax; // set up loop (it expects zr3 and zi3 to already be set) zr3 = zr * zr; zi3 = zi * zi; i = 0; while ((i < imax) && (zr3 + zi3 < zmax2)) { i += 2; // Each of these statements becomes a single fmadd or // fmsub instruction. No result gets used right away. // This allows partial interleaving in the FP pipeline. zr2 = zi*zi - cr; // fmsub zi2 = zr*zi; // fmul zr2 = zr*zr - zr2; // fmsub zi2 = k2*zi2 + ci; // fmadd zr = zi2*zi2 - cr; // fmsub zi = zr2*zi2; // fmul zr = zr2*zr2 - zr; // fmsub zi = k2*zi + ci; // fmadd zi3 = zi * zi; // fmul zr3 = zr * zr; // fmul } // now decrease the iterations and back up if we find that it // overflowed on the z2 iteration. if (zr2*zr2 + zi2*zi2 > zmax2) { i--; zr = zr2; zi = zi2; } return i; } You can make it run faster by unrolling the loop even more, and you can even iterate two points at once to keep the FP pipeline full; combining both ideas I've gotten over 3 times the speed: 12.8 million Mandelbrot iterations per second and 89 megaFLOPS. That routine is a little too long to list here (-: - Robert Munafo