Removes ZeroFrog's "optimized" memcpy and memcmp functions. #370

Sonicadvance1 · 2014-05-15T09:22:28Z

These were only compiled in on Windows and x86_32.
They provided "optimized" copies and compares based on blocksizes for the AMD Athlon and Duron CPU families.
The code was taken from something that AMD provides with a as-is license.
Just get rid of this crap.

delroth · 2014-05-15T09:23:36Z

LGTM, libc memcpy is probably much faster nowadays.

degasus · 2014-05-15T09:27:46Z

As x86_32 is deprecated, the performance doesn't matter much, so LGTM

delroth · 2014-05-17T15:42:21Z

Please rebase.

lioncash · 2014-05-17T15:43:20Z

Looks good to me as well (following the rebase)

These were only compiled in on Windows and x86_32. They provided "optimized" copies and compares based on blocksizes for the AMD Athlon and Duron CPU families. The code was taken from something that AMD provides with a as-is license. Just get rid of this crap.

shuffle2 · 2014-05-22T02:57:55Z

Maybe we should replace these with optimized x86_64 (or other platform specific ones), or just a placeholder wrapper so they can be marked for improvement in the future.
It's hard to believe that any standard library routine would be faster, especially if you know e.g. the size or alignment attributes of the buffers at compile time, but the compiler can't prove they are constant for whatever reason (for example, I would guess that many GX/DMA/etc buffers used by the emulated software happen to be nicely aligned on the host, as well). Would be interesting to see some tests around this...

Parlane · 2014-05-22T03:00:47Z

First, a word of advice. Assume that the people who wrote your standard
library are not stupid. If there was a faster way to implement a general
memcpy, they'd have done it.

Second, yes, there are better alternatives.

In C++, use the std::copy function. It does the same thing, but it is
1. safer, and 2) potentially faster in some cases. It is a template,
  meaning that it can be specialized for specific types, making it
  potentially faster than the general C memcpy.
Or, you can use your superior knowledge of your specific situation.
The implementers of memcpy had to write it so it performed well in
every case. If you have specific information about the situation where
you need it, you might be able to write a faster version. For example, how
much memory do you need to copy? How is it aligned? That might allow you to
write a more efficient memcpy for _this_specific case. But it won't be
as good in most other cases (if it'll work at all)

Matthew Parlane

On 22 May 2014 14:57, shuffle2 [email protected] wrote:

Maybe we should replace these with optimized x86_64 (or other platform
specific ones), or just a placeholder wrapper so they can be marked for
improvement in the future.
It's hard to believe that any standard library routine would be faster,
especially if you know e.g. the size or alignment attributes of the buffers
at compile time, but the compiler can't prove they are constant for
whatever reason (for example, I would guess that many GX/DMA/etc buffers
used by the emulated software happen to be nicely aligned on the host, as
well). Would be interesting to see some tests around this...

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/370#issuecomment-43843861
.

Parlane · 2014-05-22T03:03:25Z

Also see this answer:
http://stackoverflow.com/questions/1134103/clearing-a-small-integer-array-memset-vs-for-loop/1134147#1134147

shuffle2 · 2014-05-22T03:05:47Z

That is basically what I said...I expect we have situations which fall into "Or, you can use your superior knowledge of your specific situation.", especially around graphics and other buffers which the game programmers must have made properly aligned for the device to operate correctly. Our memory base is aligned, so "gamecube-aligned" buffers happen to be implicitly aligned from dolphin's view. However, the compiler cannot automagically see this.

Sonicadvance1 · 2014-05-22T15:03:11Z

If someone feels like taking the time to write a "faster" x86_64 specific memcpy and memcmp then do it. That is outside of the scope of this PR.

galop1n · 2014-05-22T15:21:23Z

I made a Memcpy16 with dst and size have to be aligned on 16 for vertex buffer upload but not in git yet and i have first to profile to see if it is worth the effort

shuffle2 · 2014-05-22T20:01:58Z

Alright, we will leave the architecture-specific optimizations to future PRs.
Comment on that: It would be nice to first collect a list of candidate areas in dolphin where such "gc-aligned" data is copied.

Removes ZeroFrog's "optimized" memcpy and memcmp functions.

galop1n · 2014-05-22T20:04:17Z

Mine was dedicated to help VB uploading, nVidia and AMD advertise to align everything on 16 to help the driver and have as little possible overhead as possible.

shuffle2 · 2014-05-22T20:07:17Z

@galop1n OK, those could be nice as well :) Data with alignment requirements on the host may have slightly different alignment/size requirements, so it might be good to have a different specialization for such things (which may take into account host GPU model, CPU model, etc).

shuffle2 added a commit that referenced this pull request May 22, 2014

Merge pull request #370 from Sonicadvance1/remove_specialized_memcmp

b58753b

Removes ZeroFrog's "optimized" memcpy and memcmp functions.

shuffle2 merged commit b58753b into dolphin-emu:master May 22, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removes ZeroFrog's "optimized" memcpy and memcmp functions. #370

Removes ZeroFrog's "optimized" memcpy and memcmp functions. #370

Sonicadvance1 commented May 15, 2014

delroth commented May 15, 2014

degasus commented May 15, 2014

delroth commented May 17, 2014

lioncash commented May 17, 2014

shuffle2 commented May 22, 2014

Parlane commented May 22, 2014

Parlane commented May 22, 2014

shuffle2 commented May 22, 2014

Sonicadvance1 commented May 22, 2014

galop1n commented May 22, 2014

shuffle2 commented May 22, 2014

galop1n commented May 22, 2014

shuffle2 commented May 22, 2014

Removes ZeroFrog's "optimized" memcpy and memcmp functions. #370

Removes ZeroFrog's "optimized" memcpy and memcmp functions. #370

Conversation

Sonicadvance1 commented May 15, 2014

delroth commented May 15, 2014

degasus commented May 15, 2014

delroth commented May 17, 2014

lioncash commented May 17, 2014

shuffle2 commented May 22, 2014

Parlane commented May 22, 2014

Parlane commented May 22, 2014

shuffle2 commented May 22, 2014

Sonicadvance1 commented May 22, 2014

galop1n commented May 22, 2014

shuffle2 commented May 22, 2014

galop1n commented May 22, 2014

shuffle2 commented May 22, 2014