Friday, 1 January 2021

T-Bug, memory management, and Cyberpunk

A Cyberpunk 2077 character in T pose
I have no inside information about the development of Cyberpunk 2077, but I am a software engineer with 35 years of experience, and I have written mods for both CD Projekt and Bioware games.

Cyberpunk is essentially two products: RED Engine, and Cyberpunk itself, which runs on top of Red Engine. The engine is very much analogous to a JVM: it abstracts the platform for the game code that runs on top of it.

The Cyberpunk layer itself, and the graphical and audio assets, are probably identical between PC, and old and new versions of the XBox and PlayStation platforms. It is the engine that differs between the platforms. The Cyberpunk layer seems to me to be in a good state of completion — there are bugs, but they're relatively minor.

The version of RED Engine used for Cyberpunk is surprisingly little changed from the version used for Witcher 3, CD Projekt's previous major game. The main obvious change is improved background loading of assets. On a PC towards the upper end of recommended spec, this too is reasonably solid: I have had one crash, one significant audio glitch, and two or three minor visual glitches in twenty hours of game play.

But it's clearly in the engine that the problems lie, and is, I think, where the problems have always been.

The game was launched simultaneously on PC, and on both 'last gen' and 'next gen' XBox and PlayStation consoles, although on both XBox and PlayStation, the code at release used the 'last gen' APIs, and next gen consoles run this in their respective backwards compatibility modes. A further release using next gen APIs is promised, but is not yet available. 

However, the game runs reasonably well on modern gaming PCs and on next generation consoles. But it runs extremely poorly on last generation consoles, to the extent of causing a great deal of negative comment. So why? What's going wrong? 

I emphasise again: I don't know, I have no inside information. This essay is reasonably well informed speculation, and nothing more. However, this is my opinion. 

What sets the older XBox and PlayStation platforms apart is that they have much more limited i/o speed, and much more limited main memory, than the newer generation (or than current PCs). They also have slower processors and more limited graphics subsystems.

Night City — the setting for Cyberpunk — is an extraordinarily ambitious and complex visual environment. To render a single static scene, hundreds of models and textures must be loaded from backing store. 

But the scenes are not static. On the contrary, the user can look around freely at all times, and can move quickly through the environment. At the same time, dozens on non-player characters, vehicles, aircraft and other mobile game objects are also moving (some rapidly) through the scene.

From a development and testing point of view, it's easy to test that a given asset can be loaded into memory and rendered in a given time. It's even relatively easy to test whether a given set of assets can be loaded in a given time.

But what I have particularly seen in the videos of the game running on old-generation hardware iis
  1. Late loading of higher resolution textures; and
  2. Assets (particularly non-player characters) being rendered in default poses.
I also hear that there are a lot of crashes, which I'll come back to.

The two issues I've described above both seem down to the program being i/o bound — it can't get data from disk to screen fast enough, because of limitations in bandwidth. That's hard physics: yes, you can work to make the graphics selection and loading code as efficient as possible, but if you need all those bits on the screen to render a scene and the system doesn't have the raw bandwidth, it isn't going to happen.

The problem is made worse by limited main memory. Where there is main memory to spare, it can be used to cache near-screen assets, so that if, for example, the player turns their head, the required assets are already in main memory. But if main memory is exhausted with all the assets currently on screen, then when the player turns their head, unwanted assets must be culled and fresh assets loaded, immediately.

This raises the issue of crashing. These game assets are big. Culling and reloading will rapidly fragment the heap. But pauses for garbage collection are really undesirable in a fast moving real time environment. Near real time GC of rapidly fragmenting heaps is hard.

Worse is, I suspect, what happens when/if the assets required to render a scene in themselves exhaust main memory. I'm pretty sure this happens, because it's noticeable that scenes rendered on old generation consoles contain fewer non-player characters than similar scenes rendered on PC. There's clearly code that decides whether to cull non plot critical non-player characters when memory load is high.

But thrashing is likely to occur — or at least, there will be need for sophisticated code to prevent thrashing — when assets required to render a scene cannot be accommodated without removing other assets also required to render the exact same scene.

This sort of code — especially when it is being developed under pressure — is very susceptible to the sort of bugs which cause crashes.

So, from a quality point of view, where does that leave us? All these aspects of engine performance are suitable for unit tests, integration tests and characterisation tests. Characterisation tests – does this code run exactly the same as that code? – may be particularly relevant when testing ports to multiple platforms.
 
If there is not a comprehensive test suite and a continuous integration platform then someone is very derelict in their duty, and I do not believe that. CD Projekt strike me, in both artistic and technical proficiency, as pretty thorough.

Furthermore, we've seen very impressive renderings of scenery and action for two years now, so the upper bound to the size and numbers of assets required for scenes has been known for at least that time. So the performance and stability problems on old generation consoles must have been known.

That implies to me that management ought to have said, at least a year ago, "we will launch only on PC and next-gen platforms, and a degraded version for old generation consoles may follow later but we don't know when."

Obviously, investors and owners of older consoles would have been disappointed, but it would have avoided a significant hit to reputation.
 
This essay started as a comment on a YouTube video, which, if you're interested, you should watch. 

 

Creative Commons Licence
The fool on the hill by Simon Brooke is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License