I made a version that uses basic optimizations that make it quite fast, without using FFTs. I ended up with 50 seconds for computing the factorial represented as a long binary integer, and around 800 seconds to turn it into a string (in decimal) and save it to a file. I have a feeling I could make the second part faster, but I don't know how right now.
The main optimization is to max out the capability of the multiply instruction. That means multiplying several numbers together (until multiplying by another one would put it above 32 bits). before multiplying by the entire array. So, for 50000!, it would take 50000*49999 = 2499950000 < 2^32, and then it would let the program know that the next number it needs to worry about is 49998. This is combined with another optimization that factors two out of every integer, so, what it would it would do for, say 16000! is 125 * 15999 = 1999875 < 2^32, and let the program also know that the final number needs to be shifted over 8 additional bits once the multiplication is done.
Now, the output of GetMultiply is multiplied by the number in 32 bit parts, which can I implemented like this:
uint64_t Carry = 0;
for (int var = 0; var < IterStop; var++){
Carry= Number[var] * MultTo + ((Carry) >> 32);
Number[var] = Carry;//only transfers the lower 32 bits
}
This actually requires MultTo to be less than 31 bits so that the addition of Carry does not overflow.
Turning this number into a decimal string is slightly easier, just divide each 32 bit section by 1000000000 < 2^32 and the remainder of the previous division is bit-shifted 32 bits and added to the number below. I am bad at explaining this, so here is my implementation:
while(Number.size() > 0)
if(Number.back() == 0)
Number.pop()
uint64_t Remainder = 0;
for_each(Number.rbegin(), Number.rend(), [&](uint32_t & Num){
Remainder <<= 32;
Remainder += Num;
Num = Remainder / 1000000000 ;
Remainder %= 1000000000 ;
});
Then, I convert the remainder to a backwards string representation
Finally, I flip the string around.
This is more efficient than doing the multiplication and the division in one step like Will does because this divides the same number, "Remainder" by the same dividend, and so the compiler can do a single divide instead of two (the x86 div instruction does modular and regular division every time it divides any number).
So there is some basic stuff that the mathematically uninclined can do to make their factorials faster.
ps. The source code is 180 lines long
This got me down to about 300 seconds for the binary representation of the 1000000 factorial. The final drop to 50 seconds was the realization that the x86 computer can do a full 64 by 64 bit multiply without truncating, and so I did some inline assembly and made the GetMultiply group to numbers less than 64 bits and got it to go 6 times faster, because it groups better and does half as many multiply instructions. But that is kind of hard if you don't know assembly.