Ce site est optimisé pour être consulté depuis un navigateur moderne dans lequel JavaScript est activé.

cocos2dx3.x runtime performance compare

wangwenjiegame

I try to the compare the Runtime performance between v2.3 and v2.5. But the result is frustrating. I can't believe the v2.3 is better than v2.5.

Mario

The second one is drawing more vertices, this is not a 1:1 comparison. Also note how the second one only has a single batch.

wangwenjiegame

The second one is because I am using the latest version of the runtime.
I hope the official tester can compare various versions of the performance. And then check out my test results if there is a problem. And I hope the new version can have a better performance.

Nate

Why does one test have more vertices? Are you drawing exactly the same things for both tests?

Also it depends what you are testing on. 500 draw calls is unlikely to be fast on mobile while desktop generally doesn't care about many draw calls.

jpoag

Non-integer framerate steps (45 fps) are evidence of CPU side problems. It's dropping every 4th frame. If it were an integer step ( {15, 30, 60} ) then you could say that it's consistently dropping frames and possibly a slow render call.

As BadLogic put it, "Use the profiler, Luke."

Nate's concern that for the same scene you are using more verts is a good concern. My first guess was that the number of verts being submitted in the DRAW call was the size of the buffer instead of the used count (it's not). My second guess is a really long command chain cache with bad verts.

If I were a betting man, I would probably fault this code here:

void SkeletonBatch::addCommand (cocos2d::Renderer* renderer, float globalZOrder, GLuint textureID, GLProgramState* glProgramState,
                                BlendFunc blendFunc, const TrianglesCommand::Triangles& triangles, const Mat4& transform, uint32_t transformFlags
                                ) {
    // We don't know what order or size we are getting attachments, or if the currently allocated verts are the right size, so...
    if (_command->triangles->verts) {
        free(_command->triangles->verts);
        _command->triangles->verts = NULL;
    }
    

// malloc every frame
_command->triangles->verts = (V3F_C4B_T2F *)malloc(sizeof(V3F_C4B_T2F) * triangles.vertCount);

// copy every frame
memcpy(_command->triangles->verts, triangles.verts, sizeof(V3F_C4B_T2F) * triangles.vertCount);

_command->triangles->vertCount = triangles.vertCount;
_command->triangles->indexCount = triangles.indexCount;
_command->triangles->indices = triangles.indices;

_command->trianglesCommand->init(globalZOrder, textureID, glProgramState, blendFunc, *_command->triangles, transform);
renderer->addCommand(_command->trianglesCommand);

// let's create a long command chain 'cache', although we aren't certain that every frame we will use every command
if (!_command->next) _command->next = new Command(); 
_command = _command->next;
}

So Cocos2dx already has an issue raised to "optimize".

Here are my suggestions, your mileage may vary.

SkeletonBatch::update is a good place to delineate frames. To reset.
Use a command pool and grab commands from there instead of new()ing them on the end of the chain. This means that at the end of the frame you would need to break the chain.
use a large chunk of memory to allocate verts. Share this memory with AttachmentVertices::_triangles->verts and write a manager that works similar to new but returns sections of the larger chunk. Reset() the manager's cursor to zero on SkeletonBatch::update

The biggest issue with this approach is reallocating. I'd use std::vector<V3F_C4B_T2F> for the memory pool, but you can only resize between frames. That means either dropping verts once the memory is full and resizing the next frame or having the user specify the init() batch size.

framusrock

We did some more profiling and posted the results in the GitHub Issue: https://github.com/EsotericSoftware/spine-runtimes/issues/767

@jpoag, What you said makes total sense and our profiling comes to the same results. But there's also some other aspects that could be optimized - even though this might be the lowest hanging fruit right now 😉

Mario

I'd actually go with an implementation like this, which gives us the biggest freedom (and would also allow more advanced stuff like two color tinting and potentially clipping) http://blog.trsquarelab.com/2015/02/custom-mesh-rendering-with-texture-in.html

However, my biggest issue is finding information on the threaded nature of cocos2dx's rendering pipeline. My plan would be to assemble the mesh(es) in Node::draw() and "submit" them for rendering in the rendering pipeline. This "submission" is essentially just setting a field on the Node which is then consumed by the method that's being called on the rendering thread. I hope cocos2d-x comes with build-in concurrency primitives for locking...

Mario

OK, I dove into the Cocos2D-X rendering architecture. They claim to do rendering on a separate thread, but really everything is done on the same thread it seems. At least the draw commands are generated on the same thread where they are consumed (scene traversal, gathering of commands, rendering of commands).

That makes things considerably easier. We'll keep the current class hierarchy (SkeletonBatch singleton, SkeletonRenderer, SkeletonAnimation). However, SkeletonBatch will have a pool of TriangleCommands. When the scene is visited, SkeletonRenderer will batch as many triangles into the same TriangleCommand provided by the SkeletonBatch. The batch submits those to the rendering queue of Cocos2d-X. In the next frame, the SkeletonBatch can reclaim the TriangleCommands, and reuse them for the next rendering phase. This should bring down allocations considerably and improve performance overall. Cocos2D-X will also have to do less batching on its own, we already have all that info when traversing the skeleton attachments, which makes it an easier task.

All of this hinges on the fact that Cocos2D-X is actually not doing rendering on a separate thread. I confirmed this for both MacOS and iOS, and it seems the same paths are taken on Windows and Android. I guess they eventually realized that threaded rendering is a bit much, especially if you allow your users to have customer commands 🙂

framusrock

OK, I dove into the Cocos2D-X rendering architecture. They claim to do rendering on a separate thread, but really everything is done on the same thread it seems. At least the draw commands are generated on the same thread where they are consumed (scene traversal, gathering of commands, rendering of commands).

Yes, Cocos2d-x is completely single-threaded as of now (except HTTP-calls).

However, SkeletonBatch will have a pool of TriangleCommands. When the scene is visited, SkeletonRenderer will batch as many triangles into the same TriangleCommand provided by the SkeletonBatch.

I'm not quite sure if I fully understand this plan. We should try to put as many triangles as possible into one single TrianglesCommand. It is only possible though to use one TrianglesCommand per texture, globalZOrder, glProgramState and blendMode. So if any of these changed since the last command that was sent to SkeletonBatch, we need to use another TrianglesCommand. Ideally, SkeletonBatch should have a pool of TrianglesCommands and keeps track of the memory allocated for each of these. If additional memory is needed / or even a new TrianglesCommand needs to be allocated, it should do so.
Malloc and Free calls are very expensive, so we should try to cut them down as good as we can - so these improvements should be very beneficial.

The batch submits those to the rendering queue of Cocos2d-X. In the next frame, the SkeletonBatch can reclaim the TriangleCommands, and reuse them for the next rendering phase.

The Cocos2d-x renderer does not free/release the TrianglesCommand handed to it, so we can safely re-use them on the next tick as you suggested.

This should bring down allocations considerably and improve performance overall. Cocos2D-X will also have to do less batching on its own, we already have all that info when traversing the skeleton attachments, which makes it an easier task.

Yes, in conclusion this means that we only ever allocate memory when we really need it (hence, more memory is needed than in the last frame) and for a typical scene that only uses one texture, globalZOrder, glProgramState and blendMode, SkeletonBatch should only ever send one single TrianglesCommand to the cocos2d-x renderer, hence also reducing the load of the renderer drastically.

Mario

There's one minor detail: we also get an MVP matrix per skeleton. That gets set on TriangleCommands. It only stays the same for all attachments on a skeleton. The best we can do is one TrianglesCommands per skeleton. Cococs2d-X will still batch.

I'm currently implementing this strategy. However, to minimize copying to the minimum (uvs, indices), I have to rely on the latest changes in the 3.6-beta branch on GitHub.

framusrock

Ah okay, the MVP matrix...makes sense. So the TrianglesCommand pool of the SkeletonBatch will have some more but smaller members 😉

About 3.6-beta, so we should wait until we use it once 3.6 is released?

Mario

I might backport the changes to the master branch. It will incur a copy though as the attachments' computeWorldVertices do not take a stride yet. Because of that, I have to first compute world vertices and store them in a float array, then copy them over to the vertices array. Not so nice.

fantasian

Sorry, am I missing something? cocos2d-x v3.6 was supposedly released 2 years ago. v3.14.1 is supposedly the latest (stable) release 2 months ago..?

Nate

3.6 is Spine's version number, not cocos2d-x.