y-cruncher - A Multi-Threaded Pi-Program

From a high-school project that went a little too far...

By Alexander J. Yee

Announcing 5 trillion digits of Pi!
New World Record for both Desktop and Supercomputer!!!

 

(Last updated: September 4, 2010)

 

Shortcuts:

 

The first scalable multi-threaded Pi-benchmark for multi-core systems...

 

Against the Big Guns...

Faster than SuperPi on single-core...

Faster than PiFast 4.3 on dual-core...

Faster than QuickPi 4.5 on quad-core...

 

1 billion digits of Pi in 5 minutes on 6-core Core i7 @ 4.26 GHz

 

See the official XtremeSystems thread for more benchmarks.

 

Latest Version:

Windows: Version 0.5.4 Build 9148 (fix 1) (Released: August 5, 2010)

Linux      : Version 0.5.4 Build 9157 (fix 1) (Released: August 28, 2010)

 

 

Changes to v0.5.4.9157 for Linux

 

Starting from v0.5.2, y-cruncher allows Pi computations of up to 10 trillion digits.

However, y-cruncher has only been tested up to 5 trillion digits. (Credit: Shigeru Kondo)

 

Records Set by y-cruncher:

World Record Size Computations

Date Announced Date Completed: Source: Who: Constant: Decimal Digits: Time: Computer:
August 2, 2010 August 2, 2010 Source Shigeru Kondo &
Alexander Yee
Pi 5,000,000,000,000 Compute:  90 days
Verify:  64 hours
2 x Intel Xeon X5680 @ 3.33 GHz
96 GB DDR3 @ 1066 MHz
16 x 2 TB
July 8, 2010 July 8, 2010 Source Alexander Yee Golden Ratio 1,000,000,000,000 Compute:  114 hours (4.7 days)
Verify:  ~7 days*
*Not a continuous run.
"Nagisa"
2 x Intel Xeon X5482 @ 3.2 GHz
64 GB DDR2 FB-DIMM
1.5 TB (Boot + Output)
4 x 1 TB (2 x 2 RAID0) + 6 x 2 TB
July 5, 2010 July 5, 2010 Source Shigeru Kondo e 1,000,000,000,000 Compute: 224 hours (9.3 days)
Verify: 219 hours (9.1 days)
Intel Core i7 980X @ 3.33 GHz
12 GB DDR3
2 TB (Boot + Output)
8 x 1 TB (Computation)
March 22, 2010 March 22, 2010 Source Shigeru Kondo Square Root of 2 1,000,000,000,000 Compute:  193 hours (8.0 days)
Verify:  98 hours (4.1 days)
Core i7 975 @ 4 GHz - 12GB
8 x 1 TB HDs
2 x Xeon W5590 - 144GB
16 x 2 TB HDs
February 21, 2010 February 20, 2010 Source Alexander Yee e 500,000,000,000 Compute and Verify:
307 hours (12.8 days)
"Ushio"
April 16, 2009 April 16, 2009 Source Alexander Yee &
Raymond Chan
Catalan's Constant 31,026,000,000 Compute:  178 hours (7.4 days)
Verify:  221 hours (9.2 days)
"Nagisa"
March 13, 2009 March 13, 2009 Source Alexander Yee &
Raymond Chan
Euler-Mascheroni Constant 29,844,489,545 Compute:  205 hours (8.5 days)
Verify:  269 hours (11.2 days)
"Nagisa"
February 28, 2009 Source Alexander Yee &
Raymond Chan
Log(10) 31,026,000,000 Compute and Verify:
40 hours (1.7 days)
"Nagisa"
February 15, 2009 Source Alexander Yee &
Raymond Chan
Zeta(3) - Apery's Constant 31,026,000,000 Compute:  45 hours (1.9 days)
Verify:  44 hours (1.8 days)
"Nagisa"
February 4, 2009 Source Alexander Yee &
Raymond Chan
Log(2) 31,026,000,000 Compute:  24 hours, 10 minutes
Verify:  15 hours, 58 minutes
"Nagisa"
Janurary 31, 2009 January 31, 2009 Source Alexander Yee &
Raymond Chan
Catalan's Constant 15,510,000,000 Compute:  88 hours (3.7 days)
Verify:  100 hours (4.2 days)
"Nagisa"
Janurary 21, 2009 January 21, 2009 Source Alexander Yee &
Raymond Chan
Zeta(3) - Apery's Constant 15,510,000,000 Compute:  20 hours, 18 minutes
Verify:  21 hours, 1 minute
"Nagisa"
Janurary 18, 2009 January 18, 2009 Source Alexander Yee &
Raymond Chan
Euler-Mascheroni Constant 14,922,244,771 Compute:  96 hours (4 days)
Verify:  134 hours (5.5 days)
"Nagisa"
January 7, 2009 Source Alexander Yee &
Raymond Chan
Log(2) 15,500,000,000 Compute:  12 hours, 34 minutes
Verify:  8 hours, 20 minutes
"Nagisa"

Note that starting from v0.5.2, the computation limits of the program are no longer locked below the current world records. So barring any bugs, anyone with sufficient resources will be able to break these records.


The Storyline:

2005 - 2006:

The roots of y-cruncher date all the way back to my senior year in high school in my AP Computer Science class.

It started from a class project which was to write a multi-precision arithmetic library in Java that would support addition, subtraction, multiplication, and division.
After the assignment was due, I continued working on the library and named it "BigNumber". Some of the new features that were added were square roots, trig-functions, constants, etc...

 

June - October 2006:

After graduation, I began to take speed seriously. Multiplication was completely rewritten in C and linked back to BigNumber using JNI. This was around the time that I began to realize that parts of "BigNumber" were fairly fast - comparable to Mathematica. In particular, the function for computing the Euler-Mascheroni Constant was faster than that of Mathematica 5. By October, it came to realization that the world record of 108 million digits for the Euler-Mascheroni Constant was in reach.

 

November 2006:

With the goal of breaking the world record of 108 million digits for Euler's Constant in mind, November was spent entirely on implementing and optimizing the algorithms needed for extremely high precision arithmetic. I also upgraded my laptop from 512MB to 1.5GB of ram as that would be the computer that I would use for such a computation.

 

December 2006:

Finals week and with winter break approching, BigNumber was used to compute 116 million digits of Euler's Constant on my laptop for what appeared to be a new world record. The computation ran for 38.5 hours and the verification ran for 48 hours. It required 1.8 GB of memory.

 

Early 2007:

Lots of media attention... As well as a lot of hate mail saying that it was not a world record. (S. Kondo and S. Pagliarulo already had 2 billion digits, but they hadn't announced it.)

During this time I also made a number of minor improvements to BigNumber. Though all work was pretty much halted by April because of the release of a number of new video games.

 

Sometime between November - December 2007:

In the middle of one of my boring lecture classes - Lightbulb!!! The Hybrid NTT algorithm for multiplication was born. This effectively renewed my interest in this area.

 

2008:

BigNumber was rewritten from scratch in C++ and renamed y-cruncher.

("y" is gamma, the symbol for the Euler-Mascheroni Constant - but I still pronounce it as "y")

 

Click to expand this section. (Warning: technical terminology)

 

January 2009 (back from winter break):

With Nagisa back up and running, Raymond and I managed to break the world records for Log(2) and the Euler-Mascheroni Constant. (Main Article)

And with that, we released the first public version of y-cruncher.

 

By the end of the month, we had also taken the world records for Apery's Constant and Catalan's Constant.

No celebration though, since neither of us could legally drink yet...

 

 

Current:

Currently, y-cruncher is just a mere side-hobby. I no longer work on it as much as I did in 2008 - not even close by a long-shot.

Gaming and school-work now take priority.

 

The build numbers were started when the rewrite began back in January 2008. During the 9 months of active development before the first public release, there were 6000 builds. But during the 6 active months of development from January to October 2009, there were fewer than 1700 builds. (Again no work was done over the summer because of internship.)

 

Features:

Aside from computing π and other constants, y-cruncher is great for stress testing 64-bit systems with lots of ram.

Download:

Known Issues
(as of current release)

Version History:

Main Page: y-cruncher - Version History

 

Algorithms:

If you're interested in what formulas and algorithms y-cruncher uses:

Main Page: y-cruncher - Language and Algorithms

 




Windows: y-cruncher v0.5.4.9148 (fix 1).zip (7.15 MB)
Linux      : y-cruncher v0.5.4.9157 (fix 1) (x64 SSE3 - Linux).out (1.91 MB)
 

System Requirements:
Click here for older versions.


Please do not link directly to the file downloads as there may be newer versions.
Just link to http://www.numberworld.org/y-cruncher/#Download instead. Thanks!

 

 

 

 

 

 

 

 

Performance:

y-cruncher is the first efficient and publicly available Pi-calculator that can sustain a near 100% cpu load on multi-core computers.
There are other multi-threaded Pi-programs that can achieve high cpu usage, but few of them can sustain it through an entire Pi computation.

 

Below is a typical CPU utilization graph of y-cruncher when computing 1 billion digits of Pi across 8 cores.

 

y-cruncher uses less memory than most other Pi-programs. It is also able to multi-thread WITHOUT significantly increasing memory usage.

 

Benchmarks:

Comparison Chart: (Last updated: June 9, 2010)

 

All times in seconds.

All benchmarks were done using the fastest binary with the fastest achieved settings for the system they were run on.

v0.5.3 and v0.5.4 are exactly the same speed. So results are directly comparable.

Number of Digits Core 2 Quad
(8 MB cache)
2.4 GHz
Phenom II X4
3.2 GHz1
Core i7
2.67 GHz
2
Core i7
4.0 GHz
3
4 x Opteron
(Barcelona)
2.31 GHz
4
2 x Xeon
(Harpertown)
3.2 GHz
2 x Xeon
(Westmere-EP)
3.33 GHz
5
v0.5.3 v0.5.4 v0.5.4 v0.5.4 v0.5.3 v0.5.4 v0.5.3
1,000,000 0.566   0.390 0.259   0.353  
10,000,000 5.286   3.667 2.466   3.371  
100,000,000 68.95   43.60 29.53 35.09 30.81 16.29
1,000,000,000 990.0   619.4 424.3 468.1 395.9 202.5
10,000,000,000           5,339 2,721

1This was actually a 2.8 GHz Phenom II X3. It was unlocked to 4 cores and then overclocked to 3.2 GHz. Credit to Raymond Chan.

2Intel Turbo Boost Technology increases actual operating frequency to 2.8 GHz.

3Overclocked from 2.67 GHz. Actual operating frequency after Turbo Boost is 4.2 GHz.

4Credit to skycrane from XtremeSystems.

5Intel Turbo Boost Technology increases actual operating frequency to 3.46 GHz. Credit to Shigeru Kondo.

 

Click to see benchmarks of older versions.

 

 

Multi-core Scaling: How much faster is multi-threading?

Processor(s): CPU Frequency*: Memory: Memory Frequency: Multi-Threading Benefit: View Benchmark Data:
Intel Core 2 Quad Q6600 @ 2.4 GHz 2.4 GHz 6 GB DDR2 800 MHz 3.570 x View Benchmarks
Intel Core i7 920 @ 2.67 GHz 3.34 GHz (3.5 GHz Turbo Boost) 12 GB DDR3 1336 MHz 4.104 x View Benchmarks
2 x Intel Xeon X5482 Harpertown @ 3.2 GHz 3.2 GHz 64 GB DDR2 800 MHz 6.458 x View Benchmarks

 

*Note that CPU frequencies higher than the stock frequency imply overclocking.

 

Click to see results from older versions.

 

 

Random Screenshots: (from my test machines)

Pi - 500 million digits (6 minutes, 40 seconds) Pi - 1 billion digits (7 minutes) Pi - 100 billion digits (28 hours)
2.8 GHz Phenom II X3
(Unlock to 4 Cores + Overclock to 3.2 GHz)
720 Deneb
2.67 GHz Core i7
(Overclock to 4.2 GHz)
920 Bloomsfield
Dual 3.2 GHz Quad-Core Xeon
X5482 Harpertown
4 GB DDR3
1333 MHz (dual channel)
12 GB DDR3
1200 MHz (triple channel)
64 GB DDR2 FB-DIMM
800 MHz (quad channel)
    8 x 2 TB Hitachi Deskstar

 

Fastest Times:

(Last updated: May 17, 2010)

 

All times in seconds.

 

Green indicates that the benchmark has been validated.

Red indicates that the benchmark was either not validated, or no validation was provided.

 

In the future, I may decide to allow only validated benchmarks on this list.

As of the current release, only Ram-Only Pi computations done using the Benchmark feature will be validated. However, starting from version 0.5.2, all computations have validation. This includes both swap modes as well as all the other constants.

Desktop (Limit One Processor)
Digits Time Version Computer Credit
25,000,000 4.654 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.66 GHz - on Air 6 GB DDR3 Shigeru Kondo
50,000,000 9.740 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.53 GHz - on Air 6 GB DDR3 Shigeru Kondo
100,000,000 21.945 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.40 GHz - on Air 6 GB DDR3 Shigeru Kondo
250,000,000 61.246 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.40 GHz - on Air 6 GB DDR3 Shigeru Kondo
500,000,000 135.432 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.26 GHz - on Air 6 GB DDR3 Shigeru Kondo
1,000,000,000 312.441 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.26 GHz - on Air 6 GB DDR3 Shigeru Kondo
2,500,000,000 956.499 v0.4.4 x64 SSE4.1 Intel Xeon X5650 @ 3.82 GHz (4.2 GHz Turbo Boost) ? GB DDR3 jcool @ XtremeSystems
5,000,000,000 3,922.15 v0.5.2 x64 SSE4.1 Intel Core i7 920 @ 3.34 GHz (3.5 GHz Turbo Boost) - on Air
Hard Drives: 4 x 2 TB
12 GB DDR3 Alexander Yee
10,000,000,000 2.953 hours v0.5.2 x64 SSE4.1 Intel Core i7 920 @ 3.34 GHz (3.5 GHz Turbo Boost) - on Air
Hard Drives: 4 x 2 TB
12 GB DDR3 Alexander Yee
25,000,000,000 9.591 hours v0.5.2 x64 SSE4.1 Intel Core i7 920 @ 3.34 GHz (3.5 GHz Turbo Boost) - on Air
Hard Drives: 4 x 2 TB
12 GB DDR3 Alexander Yee
50,000,000,000 22.343 hours v0.5.2 x64 SSE4.1 Intel Core i7 920 @ 3.34 GHz (3.5 GHz Turbo Boost) - on Air
Hard Drives: 4 x 2 TB
12 GB DDR3 Alexander Yee
100,000,000,000 40.025 hours v0.5.2 x64 SSE4.1 Intel Core i7 975 @ 4.00 GHz
Hard Drives: 8 x 1 TB
12 GB DDR3 Shigeru Kondo
250,000,000,000 - - - - - -
Desktop (Limit One Processor)
Digits Time Version Computer Credit
1M 1,048,576 0.221 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 5.21 GHz - on Dry Ice 6 GB DDR3 Shigeru Kondo
2M 2,097,152 0.411 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 5.21 GHz - on Dry Ice 6 GB DDR3 Shigeru Kondo
4M 4,194,304 0.776 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 5.21 GHz - on Dry Ice 6 GB DDR3 Shigeru Kondo
8M 8,388,608 1.583 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.66 GHz - on Air 6 GB DDR3 Shigeru Kondo
16M 16,777,216 3.129 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.66 GHz - on Air 6 GB DDR3 Shigeru Kondo
32M 33,554,432 6.273 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.53 GHz - on Air 6 GB DDR3 Shigeru Kondo
64M 67,108,864 13.362 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.40 GHz - on Air 6 GB DDR3 Shigeru Kondo
128M 134,217,728 30.452 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.40 GHz - on Air 6 GB DDR3 Shigeru Kondo
256M 268,435,456 66.652 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.26 GHz - on Air 6 GB DDR3 Shigeru Kondo
512M 536,870,912 145.808 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.26 GHz - on Air 6 GB DDR3 Shigeru Kondo
1G 1,073,741,824 334.578 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.26 GHz - on Air 6 GB DDR3 Shigeru Kondo
2G 2,147,483,648 749.534 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.13 GHz - on Air 12 GB DDR3 Shigeru Kondo
4G 4,294,967,296 - -   - - -

 

Any Computer (No Processor Limit)
Digits Time Version Computer Credit
25,000,000 3.849 v0.5.3 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
50,000,000 7.585 v0.5.3 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
100,000,000 14.512 v0.5.3 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
250,000,000 38.582 v0.5.3 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
500,000,000 79.311 v0.4.4 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
1,000,000,000 174.470 v0.4.4 x64 SSE4.1 2 x Intel Xeon X5680 @ 4.3 GHz 12 GB DDR3 sRHunt3r @ XtremeSystems
2,500,000,000 552.673 v0.5.2 x64 SSE4.1 4 x Intel Xeon X7560 @ 2.27 GHz (HT Off) 128 GB DDR3 Daniel Ghidali
5,000,000,000 1,143.750 v0.5.2 x64 SSE4.1 4 x Intel Xeon X7560 @ 2.27 GHz (HT Off) 128 GB DDR3 Daniel Ghidali
10,000,000,000 2,276.566 v0.5.2 x64 SSE4.1 4 x Intel Xeon X7560 @ 2.27 GHz (HT Off) 128 GB DDR3 Daniel Ghidali
25,000,000,000 1.720 hours v0.5.3 x64 SSE4.1 4 x Intel Xeon X7560 @ 2.27 GHz (HT Off) 128 GB DDR3 Daniel Ghidali
50,000,000,000 14.807 hours v0.5.3 x64 SSE4.1 2 x Intel Xeon X5482 @ 3.2 GHz
Hard Drives: 1.5 TB + 4 x 1 TB
64 GB DDR2 Alexander Yee
100,000,000,000 17.283 hours v0.5.3 x64 SSE4.1 2 x Intel Xeon X5650 @ 2.66 GHz
Hard Drives: 16 x 2 TB
144 GB DDR3 Shigeru Kondo
250,000,000,000 83.586 hours v0.5.2 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz
Hard Drives: 8 x 2 TB
144 GB DDR3 Shigeru Kondo
500,000,000,000 172.396 hours v0.5.2 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz
Hard Drives: 16 x 2 TB
144 GB DDR3 Shigeru Kondo
1,000,000,000,000 12.260 days v0.5.3 x64 SSE4.1 2 x Intel Xeon X5650 @ 2.66 GHz
Hard Drives: 16 x 2 TB
96 GB DDR3 Shigeru Kondo

Any Computer (No Processor Limit)
Digits Time Version Computer Credit
1M 1,048,576 0.221 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 5.21 GHz - on Dry Ice 6 GB DDR3 Shigeru Kondo
2M 2,097,152 0.411 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 5.21 GHz - on Dry Ice 6 GB DDR3 Shigeru Kondo
4M 4,194,304 0.776 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 5.21 GHz - on Dry Ice 6 GB DDR3 Shigeru Kondo
8M 8,388,608 1.583 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.66 GHz - on Air 6 GB DDR3 Shigeru Kondo
16M 16,777,216 3.129 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.66 GHz - on Air 6 GB DDR3 Shigeru Kondo
32M 33,554,432 6.273 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.53 GHz - on Air 6 GB DDR3 Shigeru Kondo
64M 67,108,864 13.362 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.40 GHz - on Air 6 GB DDR3 Shigeru Kondo
128M 134,217,728 30.452 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.40 GHz - on Air 6 GB DDR3 Shigeru Kondo
256M 268,435,456 66.652 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.26 GHz - on Air 6 GB DDR3 Shigeru Kondo
512M 536,870,912 145.808 v0.5.3 x64 SSE4.1 Intel Core i7 980X @ 4.26 GHz - on Air 6 GB DDR3 Shigeru Kondo
1G 1,073,741,824 328.764 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 6 GB DDR3 Dave Hunt
Movieman @ XtremeSystems
2G 2,147,483,648 731.146 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
4G 4,294,967,296 1,595.959 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
8G 8,589,934,592 3,689.989 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
16G 17,179,869,184 8,184.953 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
32G 34,359,738,368 24,047.321 v0.4.3 x64 SSE4.1 2 x Intel Xeon W5590 @ 3.33 GHz 72 GB DDR3 Shigeru Kondo
64G 68,719,476,736 - -   - - -

*These fastest times may include unreleased betas.
Got a faster time? Let me know: a-yee@northwestern.edu


FAQ:

Q:  Is there a Linux version?
A:  Yes, it's about time. y-cruncher has been successfully ported to linux. However, it is slower than the Windows versions due to the lack of Linux-specifc tuning. Furthermore, some of the features had to be disabled in the Linux versino... But it's better than nothing.

Whenever I get the time, I'll slowly bring it up to the level of the Windows versions.
 
 
Q:  Is there a distributed version that performs better on NUMA and HPC clusters?
A:  As much as I'd like to optimize the program for NUMA and HPC clusters, I do not have the resources needed to do it.

To make a distributed/NUMA-friendly version, I need to have (or have an extended period of time of exclusive/physical access to) a highly Non-Uniform Memory machine (such as a high-end quad or 8-socket Opteron server), or an HPC cluster of 4 or more identical high-end desktops with high-speed connectivity.
Both of these are well beyond my student-sized budget.

If you are wondering about my Nagisa Workstation, it is also well beyond my budget... But long story short, I basically got it for free. Don't ask...
 
 
Q:  Can you make a CUDA version?
A:  Not yet...

Here are the major reasons:

  1. GPUs currently have very poor double-precision floating-point (DP-FP) performance. y-cruncher relies heavily on DP-FP for its speed.

     

  2. GPUs are highly vectorized. y-cruncher isn't ready for massive scalable vectorization.

     

  3. CUDA currently does not support recursion. (There's a lot of multi-way recursion in y-cruncher. I'm not inclined to try rewriting them using loops.)

     

  4. y-cruncher's purpose is efficiency on large computations. GPUs simply don't have enough ram to do large computations locally.

    The bandwidth between GPU and main memory will probably be a huge bottleneck. y-cruncher is already somewhat bottlenecked by bandwidth on a CPU. On a GPU, it will be much more bottlenecked because the GPU has much more computational power and the (GPU <--> main memory) bandwidth is usually less than (CPU <--> main memory) bandwidth.

    This holds even for benchmarking. If y-cruncher were able to fully utilize a GPU, benchmarks would be extremely fast - so fast that the largest computation that could be done in ram (either GPU ram, or CPU ram - it doesn't matter) would likely be too short to be a worthwhile benchmark.

     

  5. There is currently no set-in-stone standard for GP-GPU programming.

Note that Nvidia's upcoming Fermi-based video cards will solve a number of these issues. But for now, I'll play by ear.
 
Q:  How does y-cruncher compare to other programs?
A:  Below is a table of the five fastest (publicly available) programs and how y-cruncher compares to them.

Program Author(s) Description + Environments where it beats y-cruncher
TachusPI Fabrice Bellard
  • Holder of the current world record for the most digits of Pi computed on both supercomputer and desktop.
  • It is faster than y-cruncher for both single-threaded and multi-threaded computations. However, it does not appear to scale as well with multi-threading.
  • Despite being slower than TachusPi, y-cruncher is able to beat TachusPi when scalability becomes important.
    • On dual-core, TachusPi 0.9.2 is faster than y-cruncher v0.5.3.
    • On quad-core systems, TachusPi and y-cruncher are virtually tied.
      (TachusPi is faster for small computations, y-cruncher is faster for larger ones.)
    • On 8-core systems, y-cruncher pulls ahead.
Parallel GMP-Chudnovsky David Carver + Hanhong Xue + GMP team
  • This is a paralleled version of GMP-Chudnovsky using OpenMP. It appeared back in October 2008 and was improved a month later. It runs much faster on AMD processors than Intel processors. But because of its use of GMP, the true speed of this program cannot be achieved in Windows due to the lack of assembly support.
  • On AMD K10 (in linux), the x64 version appears to beat y-cruncher (in Windows) for:
    • All computations below a million digits.
    • All single-threaded computations.
    • Dual-thread computations below a few million digits.
  • For larger computations with 4 or more cores, y-cruncher is still faster.
  • Although Parallel GMP-Chudnovsky is multi-threaded, it does not scale as well as y-cruncher. So even though it beats y-cruncher in clock-for-clock linear speed, it is slower when there are more than 2 cores.
QuickPi 4.5 Steve Pagliarulo
  • QuickPi is multi-threaded and supports x64 and SSE3.
  • Clock-for-clock, QuickPi 4.5 is faster than y-cruncher. Therefore it beats y-cruncher for single-threaded computations of less than a billion digits or so.
  • Like Parallel GMP-Chudnovsky, QuickPi 4.5 has trouble scaling up with cores. So for multi-threaded computations with 2 or more cores, y-cruncher is usually faster.
MaxxPi-Multi M. Bicak
  • MaxxPi-Multi is a relatively new program that is aimed at benchmarking. Although its purpose is not speed, it is nevertheless one of the fastest in the world. It supports SSE, multi-threading, and is the only "fast" program for computing Pi that has a GUI.
  • Clock-for-clock, MaxxPi-Multi is actually the only program in this table that is slower than y-cruncher. However, it scales decently well for the first few cores - enough to beat out GMP-Chudnovsky and PiFast 4.3 on quad-core.
  • Because MaxxPi-Multi is slower clock-for-clock, y-cruncher seems to beat it for all large computations regardless of the number of cores/threads.
GMP-Chudnovsky Hanhong Xue + GMP team
  • This is the original (single-threaded) version of GMP-Chudnovsky. It runs much faster on AMD processors than Intel processors.
  • Clock-for-clock, GMP-Chudnovsky is faster than y-cruncher. Therefore it beats y-cruncher for all single-threaded computations.
  • For multi-threaded computations with 2 or more cores, y-cruncher is still faster.
PiFast 4.3 Xavier Gourdon
  • PiFast - An old classic. It undisputedly held the title of "Fastest Program to Compute Pi" for quite a while until QuickPi passed it. Using only x86 and x87 FPU instructions, PiFast packs a very impressive speed. It is also one of the most memory efficient programs for computing Pi.
  • Clock-for-clock, PiFast 4.3 is virtually tied with y-cruncher 0.4.3 (x86). However, it is more efficient for larger computations. The cross-over point between PiFast 4.3 and single-threaded y-cruncher 0.4.3 (x86) is about 10 million digits on Core i7. Below that, y-cruncher is slightly faster. And above that, PiFast 4.3 is slightly faster.
  • Because of the virtual tie, any advantage for y-cruncher will tip the balance. Therefore any of (SSE3, x64, multi-threading) will make y-cruncher faster.

Just to clear up a few things: y-cruncher is intended to be fast, but not optimal. It is optimized for memory efficiency on large computations.
Utmost speed is not important as y-cruncher can probably be made 10 - 30% faster by relaxing memory constraints and using decimal arithmetic.
 
 
Q:  Why does y-cruncher run 4 threads on my 3-core system (8 threads on 6-core, etc...)
A:  This is due to practical restrictions in the algorithms that are used by y-cruncher. Because of the nature of the algorithms that y-cruncher uses, they are most efficiently paralleled when the thread count is a power of two. To deal with systems that don't have a power-of-two number of logical cores, y-cruncher simply rounds up to the next power of two.

The overhead of running extra threads is usually very small. Any load balancing issues that result from awkward thread-to-core ratios are usually resolved by further increasing the thread count. (as explained in the next Q/A)
 
Q:  Why does y-cruncher create more threads than I tell it to use? Because of this I can't get dual-core benchmarks on a quad core machine since it will use all 4 cores even in dual-core mode.
A:  This is by design and is NOT a bug. Because of the nature of some of the algorithms, I find it necessary to spam threads in order to maximize multi-core efficiency. The work-around is to go to "Processor Affinity" in Task Manager and uncheck the cores that you do not want y-cruncher to use. y-cruncher does not do this automatically because it "doesn't know which logical cores are the best to use".

I call this method "Thread Spamming". Yes, it sounds ridiculous. But it's a very simple and effective way to deal with load imbalance.
 
Q:  Is y-cruncher open-sourced?
A:  No.
 
Q:  Is there a publicly available static library for the multi-threaded arithmetic that y-cruncher uses?
A:  No. At least not now...

y-cruncher's arithmetic module is indeed isolated from rest of the program in its own library. I call it "YMP" (y-cruncher Multi-Precision Arithmetic Library), but it is also closed-sourced.

Currently, y-cruncher is the only thing that uses YMP in it's entirety. But there is a growing interest to use y-cruncher's FFT in some signal processing and optical-related work since it is significantly faster than FFTW in a number of performance critical applications.
Q:  Who are you? Are you really still in college? What degrees do you have? etc...
A:  Yes I'm still in college. As of spring 2009, I am 21 years old and a Junior undergraduate student at Northwestern University just north of Chicago, Illinois.
Therefore, I don't even have a college diploma yet - let alone a masters or Ph.D... So I apologize if my tone of writing in this entire website is of a restless college student.
I am a computer enthusiast and a semi-die hard gamer. Outside of computers, my hobbies include bowling, piano, and Japanese Anime.
And lastly, no I don't speak Japanese (as much as I'd like to). Aside from English, I speak Cantonese and a tiny bit of Mandarin.

Links:

Here's some interesting sites dedicated to the computation of Pi and other constants:

Special Thanks

Questions or Comments

Contact me via e-mail. I'm pretty good with responding unless it gets caught in my school's junk mail filter.