Unified nVidia TNT/GeForce driver for Haiku



Personal 3D 'weblog' news:


15 April 2006 (expanded 17 April 2006): 3D driver Alpha 4.1 / 2D driver 0.80 benchmarks.

As promised on the new blog, here's a new table with current 3D driver speeds. These speeds were taken with 3D driver Alpha4.1 combined with 2D driver 0.80: the latest BeOS release. The speeds are compared to Windows98 second edition speeds using a detonator driver. The detonator driver was setup to be as closely matched to the BeOS setup as was possible: This way a speed comparison should be more or less 'fair'.

If you look below at the table you'll find that more cards were tested than the BeOS driver currently supports: I was curious to know a bit more about the relative speeds between 'newer' architecture cards.
Maybe you'll also notice that TNT1 cards are slower than in older driverversions: this is because they now work correctly. Earlier much texturing errors showed as the driver did not synchronize the acceleration engine to the 'live' texture swapping that takes place on cards with 16Mb RAM. Granted, 16Mb should (probably) be enough to run Quake 2 without swapping in the mode tested, but the current BeOS driver setup is not optimal yet concerning use of memory. On TNT1 cards you'd best run Quake 2 in 16 bit colordepth for 1024x768 resolution for now, as that mode doesn't need to use texture swapping with 16Mb RAM.
Another interesting thing is the Geforce 4 MX4000. This is the second tested card with a 64bit bus to it's RAM (the other one is the TNT2-M64). I tested a 'noname' card, which was visible because of different things: The card's GPU is clocked at 275Mhz, which seems to be the official speed: my Geforce 4 MX440 runs at the same GPU clock at least. Well, because of the RAM speed related findings, I did a few extra speed tests with this card of which you'll find the results in the table below. It's interesting to see that the card was fully stable when I clocked it's RAM at 320Mhz: I tested for some 15 minutes or so. Note please that I do not advice you to overclock cards, that might be dangerous to your card and whole system. Anyhow, I did not want to keep these interesting results from you.

Further below, there's a second table showing some speeds for Quake2 timedemo 1 on other (much slower) systems.

Table: alpha 4.1 Q2 speed on P4@2.8Ghz, FSB@533Mhz; at 1024x768x32@75Hz.

Card under test:

BeOS:

Windows:

percentage:

TNT1 PCI, 16Mb (NV04)
9.3 fps (tex swapping)
23.2 fps @16bit color
-- (not supported)
-- (not supported)
--
--
TNT2-M64, 32Mb (NV05M64)
10.2 fps 21.9 fps 47 %
TNT2-pro, 32Mb (NV05)
17.0 fps 41.5 fps 41 %
GeForce2 MX400, 32Mb (NV11)
27.1 fps 86 fps 32 %
GeForce2 Ti, 64Mb (NV15)
45.6 fps 165 fps 28 %
GeForce4 MX440 AGP8x-type, 64Mb (NV18)
37.0 fps 119 fps 31 %
GeForce4 MX4000, 128Mb (NV18)
20.8 fps
22.3 fps coldstarted
28.0 fps RAM overclocked 20%
76 fps
-- (not tested)
-- (not tested)
27 %
--
--
GeForce4 Ti 4200, 128Mb (NV25)
-- (not supported) 250 fps --
GeForceFX 5200, 128Mb (NV34)
-- (not supported) 130 fps --
Table: alpha 4.1 Q2 speed on P4@2.8Ghz, FSB@533Mhz; at 1024x768x32@75Hz.

The second table (below) shows some interesting things as well. But first please note that the speeds on the dual P3 are just a tiny bit above speeds you'd note for a single P3 of the same speed: openGL is executed single threaded, which means the second CPU doesn't do any work on it. It just handles other OS tasks which reliefs the first processor from that (relatively small) burden.
Interesting things are: Also, personally, I was very pleased with the relatively good speeds I saw on that old P2 system I sometimes have access to. All system's results combined, I'd say Alpha 4.1 is a nice release.. :-)
Well, that's it for now. Have fun!

Table: alpha 4.1 Q2 speed on other systems.

System under test:

640x480, 16bit @60Hz:

1024x768, 32bit @60Hz:

P2-350Mhz, fsb@100Mhz, TNT2-ultra, 32Mb (NV05), dano
29.0 fps 21.3 fps
dual P3-500Mhz, fsb@100Mhz, TNT1, 16Mb (NV04), R5.0.1 pro
33.0 fps 6.2 fps (tex swapping)
21.4 fps @ 16bit color
dual P3-500Mhz, fsb@100Mhz, Geforce 2MX400, 32Mb (NV11), R5.0.1 pro
35.0 fps 24.8 fps
Table: alpha 4.1 Q2 speed on other systems.


12 February 2006: Temporary revival of development on nVidia Mesa 3.2.1 based 3D accelerant.

Currently I am finding myself back at 3D development for nVidia. This came about because I wanted to cleanup the 2D driver a bit: since adding a new 2D accelerated function (scaled_filtered_blit) there I was forced to look in that cleanup direction. All in all 3D rendering speed in 32bit colorspace gained upto some 11%, and I am putting some time in a real 'swapbuffers' function to gain upto another 5% speed in higher res fullscreen modes. While working on that I stumbled on the reason for the Quake2 drawing errors, which I even have a workaround for now.

All this new stuff means I'll setup an Alpha 4 version of the driver with as much optimisations as I can get going. The 'current' version alpha 3.5 was just a recompile of alpha 3, only using Mesa 3.2.1 again instead of Mesa 3.4.2 because of speed issues.

I guess it's prudent once again to update this page every once in a while, and I also have a new link for you which points at a 'real' blog page now. I'll try to keep that updated as well. Have a look here for the blog. Talk to you later!


20 September 2005: The nVidia 3D driver development is on hold for now. Let me explain why.

As I already informed you, the Mesa interface to a (accelerated) graphics driver was changed drastically between Mesa versions 3.4.2 and 3.5. Since then, the newly introduced method is still in place (AFAIK), changing relatively 'marginal'. So, I have to let the driver run in 6.2.1: there's no point in trying older versions anymore. While I already have the driver 'compiling' in 6.2.1, and clearbuffers / swapbuffers work accelerated, there is no actual rendering acceleration.

The past two months I have spend much time on trying to understand the new way of plugging in a HW driver. Slowly I am starting to see parts of it, but it's all so different that I fear I cannot make two steps at once: understanding the interface AND writing a new driver 'from scratch'. Unfortunately (AFAIK) there's no good documentation outthere about this interface that would make my life much easier.

So, I have a new 'roadmap' for you: As I was already working on a VIA graphics driver, and a 3D add-on exists for it too (that is 'upto date')(AFAIK), it makes sense to me to take this 'detour'. I need to understand the current DRI drivers fully before I can complete the nVidia driver update. And DRI uses current Mesa (of course ;-).

Mesa 6.2.1 drivers

Well, I should probably say something about what changed in the driver interface. All stuff mentioned is of course my (limited) knowledge about it, and may be incorrect/incomplete.

In Mesa 3.4.2 and earlier, there was one way of interfacing to it from a driver's viewpoint. Mesa used some datastructures to store information, which a driver needed to 'translate' into a format which it's hardware understood. Also (as far as I have seen) every primitive needed a seperate 'call' to the driver: you could render points, lines, triangles and quads. If no hardware driver was available, then Mesa could render itself 'in software'. The software functions used the same interface that a hardware function would have.

Since Mesa 3.5, things seem to be much more nifty. Mesa still has internal software rendering (fallback) functions using (more or less) the same sort of interface to Mesa's core, but the interface to HW drivers is completely new. An important change is the fact that datastructures used to hold information are now exact copies of the hardware it is going to run on: the driver nolonger needs to translate for rendering: it merely literally copies the data to the hardware engine (into it's DMA command buffer (via 'DRM' if I am right)).
Mesa even provides (larger) parts of core HW driver functions now which a driver can just include. Because those core functions are HW dependant, a 'trick' is used: If in certain cases a function cannot be accelerated, the HW datastructures (HWvertex) used by the driver need to be 'translated' to SW datastructures (SWvertex) which the Mesa internal software rendering functions understand. So: translate and call SW function(s).

On top of this all, multiple primitives can be relayed to the driver now (instead of only single primitives in the early days). Linestrips for instance. This should massively reduce software overhead for calling the hardware rendering functions, and enable a driver to instruct for instance the nVidia engine to do a complete linestrip at once (can be placed in a single command in theory). Apart from making higher framerates possible on slower CPU systems, I can even see the HW rendering itself speedup because of this.

Well, all in all I would find it quite interesting to see what results we would see using Mesa 6.2.1. But not just yet though, as I mentioned. Unless someone else creates a Mesa 6.2.1 nVidia driver of course. ;-)

VIA unichrome driver(s)

So VIA first now: or at least an attempt on it. As I promised Yellowtab anyway to do work on a VIA 2D driver, it all-in-all makes perfect sense to concentrate my efforts more on this. The 2D driver already exists (you can download the most recent versions from the Haiku build-factory), although it doesn't have 2D acceleration yet. It does set modes though, and it's gaining hardware overlay capability now (starting with V0.11). When overlay is complete I'll try to setup DMA 2D acceleration. As usual we'll just have to wait and see what I can pull off. Be warned though: I am taking my time. After all, it should remain something nice to work on...


This completes this news-entry here. I'll talk to you again I guess, but I can't yet say where that will be. I don't think I'll setup a VIA graphics driver entry on this homepage (unless 3D is going to work there): it just costs time to keep it upto date, while I even get spammed via bug report forms these days. I guess you can at least follow my commits to SVN via CIA.. :-)


12 August 2005: While I am running a Mesa 6.2.1 compile on my system (several times ;-), I am writing this short update. Since yesterday the 3D driver sits in the current stable Mesa version. Well, a big part of it at least: While the actual rendering functions file doesn't compile yet (riva_prim.cpp), and I also don't know exactly how to update it, the Alpha 4 driver to-be now does swapbuffers and clearbuffers accelerated by hardware (clearbuffers since two minutes :).

What I have to do now is find out how to plugin the 'T&L' render functions to Mesa, and plugin the texture stuff in Mesa. Both subsystems have changed considerably, so I can't say yet how much time this will cost, and even if I will be successfull. From the looks of it, my best shot for the render functions might well be to NOT plug it into the HW T&L interface, but use the software interface instead (looks much more like it was in 'the old days'). The software part is called _swrast BTW. I need to do more research to see what my options are here.

While the texture files inside the driver compile OK, I can't activate it because the interface here has changed as well. Apparantly less so though, seeing the succeeded compile. But anyway: I can't say what it's current useability is/will be. From a pretest I did, it looks like I can test both functions (more or less) independantly though, as keeping texturing disabled while inserting the render functions already accelerates and shows you some figures in Quake 2 (opponent-character-forms but not rooms for instance).

Interesting as well, was plugging the Alpha 3 driver in Mesa 3.2.1 to see what it would do to speed. Quake2 renders at 125fps in 640x480x16 mode then, and the Teapot at 750fps, topping at 800 (exactly like with Mesa 3.4.2 on my system: P4@2.8 with NV18). 1024x768x32 remains at 23fps though. All in all, I kind of hope that Mesa 6.2.1 will combine 'the best of both worlds' and give me the best speeds out of Mesa 3.4.2 and 3.2.1 or better. Even on slow-CPU systems (my P3-500 gave a few fps more with Alpha3 in Mesa3.2.1: I call it Alpha2.5 for now :-). We'll see what Mesa 6.2.1 is capable of. I hope.

Well, that's about it for now. Talk to you later...


8 August 2005: Today I've got Alpha 3 of the 3D nVidia driver for you, still accompanied by 2D driver 0.53. This new version is based on Mesa 3.4.2, which is still OpenGL 1.2. I am making it available anyway because since Mesa 3.2.1 numerous conformity fixes were done in Mesa. Also the 3D driver had some improvements in the meantime which make it slightly faster. All in all, this release will make apps run faster especially on fast CPU's while on slow CPU's especially Quake 2 runs slower. I'll leave it up to you to choose your version for now: Alpha2-final remains online for the time being: you might prefer it for quake2 on slow systems.

Some measured speeds:
Anyway: get it from the downloads page and have fun! In the meantime I'm working on directly upgrading to Mesa 6.2.1: but that's much harder to do due to the large revamp that was done inside Mesa between version 3.4.2 and 3.5. So it will probably be a while before I report back on that...


8 July 2005 (completed 9 July): Benchmarks for alpha 2-final, status update, and personal thoughts.

In the past two weeks I've been focussing on testing for the hardware boundaries of what we can do with the information available. While doing that, I collected some new benchmarks for you as well. Let's start with the latter: Here are a few tables showing the speed that alpha 2 final reaches on the systems avaliable to me for testing.

Table: accelerated alpha 2f DMA openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT2-M64, 32Mb (NV05M64) 64-bit bus
100 fps 100 fps 27.3 fps 22.3 fps 23.9 fps 15.8 fps 16.7 fps 9.7 fps
Table: accelerated alpha 2f DMA openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Table: accelerated alpha 2f DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT1, 16Mb (NV04)
125-130 fps 125-130 fps 33.8 fps 30.5 fps 29.6 fps 22.6 fps 21.4 fps 8.7 fps
Table: accelerated alpha 2f DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Table: accelerated alpha 2f DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT1 PCI, 16Mb (NV04)
450-500 fps 350 fps 53.1 fps 37.2 fps 36.2 fps 23.7 fps 22.8 fps 11.0 fps (draw errs)
GeForce2 Ti, 64Mb (NV15)
600-630 fps 600-630 fps 90.0 fps 54.5 fps 65.4 fps 35.8 fps 41.3 fps 20.2 fps
GeForce4 MX440, 64Mb (NV18)
600-630 fps 600-630 fps 118.8 fps 66.3 fps 84.8 fps 40.5 fps 51.6 fps 23.2 fps
Table: accelerated alpha 2f DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

You can see that these speeds top all previous speeds measured, indicating this is the fastest version of the driver yet. As usual, looking at GLteapot's speeds on all (but one) setups: you can determine that the software overhead limits the speed. All setups hardware can speed it up more if software overhead were to be further minimized. Although of course, we are getting closer and closer to the engine's limits. For the Quake speeds I think we reach those limits on the Pentium 4 system, except maybe for 640x480x16 mode using the GeForce4 MX440. My feeling however is, that we are almost there even in that mode.

OK, now have a look below at the Linux benchmarks I did on the P4 system using the closed source official nVidia driver (latest available on Suse 9.1, running KDE along with Quake2). You'll easily recognize that the hardware is capable of much more than we get...

Table: accelerated 3D speeds using Linux on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz.

Card under test:

Q2 640x480 @32bit:

Q2 800x600 @32bit:

GeForce4 MX440, 64Mb (NV18)
280 fps 250 fps
GeForce4 Ti4200, 128Mb (NV28)
456 fps 420 fps
GeForceFX 5200, 128Mb (NV34)
400 fps 370 fps
Table: accelerated 3D speeds using Linux on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz.

... so, we are definately missing important stuff. Probably things like: compressed Z-buffer, fast Z-clear, parallel use of (pixel) pipelines, and a better optimized engine setup (use of hardware (user) 'context' switching and the context cache). I can imagine that on pre-GeForce cards, our driver is getting close to the Linux driver in speed, because those 'extra' features might well not exist on these old cards. Anyway, let's face it: we'll have to do with what we got. I myself at least am certainly not about to start reverse engineering closed source Windows or Linux drivers: I simply don't want to put that much time in it.

Let's think in another dimension for a moment: supporting more cards at the current level, so adding support for newer cards (NV2x, NV3x, and NV4x types). I've been trying to find out more for this as well in the past weeks. Thanks to the Poke utility written by Oscar Lesta aka Bipolar (for both Windows 9.x and Haiku/BeOS) I was able to peek at the Windows setups for NV18, NV28 and NV34, and try something in BeOS. While gaining a little bit more 'general' insight, I was not able to actually improve speed and card support however. Besides, these newer cards no doubt have a different setup being used with the official drivers, as they support interesting new programmable features, not available in the cards currently supported.

All in all, we won't be able to add more cards to the support list. Unless (preferably, as far as I am concerned) information comes up on how to set the NV2x, NV3x and NV4x into pre-GF (or NV1x?) compatibility mode. If such a thing even exists, that is. You see, if this were possible, the current simple driver setup would suddenly support all nVidia cards there are. And, while not being superfast, you would still see the speed go up if you use newer cards (compared to older cards). The only thing needed for this would be an update of the 2D driver's 3D init code, which would be feasable to do from a development-time-needed perspective. And, as we now know, the speeds gained are already interesting enough to enable a whole 'new' breed of software on our platform...


Roadmap considerations / update

Let's peek back at what I originally wrote:

quote:
During development it's very important to take the smallest steps possible for the largest chance on success. This dictates that I should take the following roadmap to get to the goal: unquote.

Steps 1. and 2. are done. I've also looked at improving speed and card support. We will remain at the current maximum speed level (as it now is on faster CPU's). We might still improve speeds on slower CPU's. The cards supported are NV04 (TNT1) upto/including GeForce4 MX (NV18). The driver will block attempting to use newer cards from now on.

It's time to start work on step 3: switching to current Mesa. After that we'll switch to a real renderer add-on (step 4.), as described on the Haiku homepage. Philippe Houdoin will come up with a example software driver that I'll use to help me create the add-on.

Most of the remaining work will be doing step 3. As the current Mesa is at openGL 1.5 level, while the driver is now at openGL 1.2, switching will require the driver to be expanded and rewritten concerning the Mesa interface. This interface changes for every new version of openGL, as new functions get added all the time. It might be a good idea to do the upgrade in steps, which means I would first have to make it work with Mesa 3.4.2, which is the newest openGL 1.3 compatible library. The change in openGL for this update (concerning the driver interface) is added multiple render/pixelbuffer support (if I remember correctly). Also the Z-buffer interface has to be revised. This was to be expected, as the current used Mesa version (3.2.x) shows kind of inconsistencies here as I previously 'discussed' (I needed to literally copy code from Mesa into the driver to make it work).

Somewhere during updating to current Mesa, I'll probably add support for AA rendered triangles. Also multitextured triangles seem to be possible: I already did a pre-test using the DX6 command; I was able to use it to clear the Z-buffer successfully. Which indicates to me that this hardware function is up and running in the engine so we should be able to make use of it. Interesting is that this function has hardware support for stencil buffers, which might mean we can get that working as well. But I won't make promises, we'll just have to wait and see. I also benchmarked this DX6 function for speed already: and no, by itself it's not faster than the currently used rendering function (DX5, single textured).


Well, the last thing I can talk about today is the time I think I need to switch to current Mesa. Let's put it this way: I plan to do it before the year is over. First up is holiday season, so I'll probably have to do the actual work in the coming fall. And I'll be a bit slower as usual, as my body seems to indicate to me I should (RSI-like trouble, which every computer user has to face at some point it seems). But, as Philippe nicely pointed out: the journey *is* all the fun, right?


General driver performance considerations and personal thoughts

So. Are you disappointed? Well, don't be. We are lucky we got this far. It's a general problem every driver writer has these days: 'no-one' is giving out specs. While in 'the old days' this was not so (our nVidia driver is based on work done by nVidia themselves; Matrox used to hand out specs: upto and including G400), these days even Matrox nolonger responds to your mails.
I think it might be valid we would try to support hardware manufacturors who do give out specs, either via maintained opensource drivers, or register-level specs. Even hardware considered slow would perform much better than the fast hardware nVidia or Ati makes. Look at the Linux speeds with the closed sourced nVidia driver. Even if a slow card performs one-fourth compared to a top-notch brand if both use full drivers, we would be able to use that slow card at vastly higher speeds with a 'supported' driver. And of course, with more hardware features in it.
Anyway, we still need to support hardware from less 'friendly' manufacturors, as in practice this hardware is most commonly used. So, hence my entire work for nVidia cards. I have to tell you however, I am looking more and more at 'cheap' and 'slow' hardware, as it would give me lots more pleasure working on that instead if for once I get my hands on 'full specs'.

OK, that might sound a bit negative, but it's certainly not meant that way. I understand that manufacturors need to protect their intellectual property in this world. And I really already had a lot of fun doing my development. For nVidia hardware I worked on almost every aspect available in the hardware (remember BeTVOut? :-). It's just that it sounds like much more fun, doing the same for hardware that I'd have much better specs on. Which does not mean I will even start on that though, mind you: I've learned that it costs enourmous amounts of time and energy doing a full blown graphics driver..


Anyway, back to performance. The way I see it, it's very nice having a simple driver that just supports (more or less) what we already have. As the supported hardware function is very general, it doesn't really matter how sophisticated your app setup is: it will probably get accelerated anyway. Mesa very nicely takes care of doing software emulation of every other aspect you use, and the final rendering is then done by the accelerated driver. While the total setup is not superfast, it nicely accelerates anyway. While the amount of work needed to be done for the actual driver is very minimal. And, with the way Mesa works, you can add more functions over time: making it doable for just one person if it needs to be. You see, even if we had full specs, we wouldn't have the manpower to put it all to use. Basic specs covering enough to do the primary setup would suffice in practice...

Me, as an 'alternate OS' user, gladly pays more money for a system performing a bit less: speed and feature wise, if that means I get a system that's very stable and that's easily maintained. That's what I want out of my computers. Otherwise, I'd rather just dump them in the bin. More BeOS (style), anyone? Cheers!


23 June 2005: OK, here's 3D-addon version Alpha 2-final. Get it from the downloads page and have fun! While you do that I'll do some more benchmarking of which I will post the results here. A small preview: Hope you like it. Meanwhile: if you test this driver, please provide feedback as usual!

Thanks in advance. Talk to you later..


18 June 2005: I'm finally readying the alpha2 driver for release: it will be accompanied by nVidia 2D driver 0.53. Below is a screenshot of GLteapot as it's running on my P4 2.8Ghz now :-)
Well, actually, the speed is a bit higher: it's difficult to grab a correct screenshot, as taking the shot slows things down. The mean speed is about 470fps, topping at 500fps for short moments with the default settings (in both 16 and 32 bit color).


GLteapot spinning on a P4 @ 2.8Ghz, with GeForce4MX440 @ AGP4x.


Since the benchmarks as shown before the driver has had a few more updates, further speeding up rendering on all systems. Anyway, the speed is 'set' for alpha 2 now, so I'll show you a few new benchmarks here. Shown also (for reference) are the older DMA speeds that were reached. On top of that, I also benchmarked for PCI mode explicitly, to show the gain AGP mode now gives us (on a number of systems).

Table: accelerated DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT1, 16Mb (NV04).
old speed (AGP mode)
110 fps 110 fps 27.4 fps 24.0 fps 23.4 fps 17.9 fps 17.2 fps 8.2 fps
TNT1, 16Mb (NV04).
new speed (PCI and AGP mode)
120 fps 120 fps 31.8 fps 29.3 fps 28.4 fps 22.3 fps 21.1 fps 8.7 fps
Table: accelerated DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Table: accelerated DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

GeForce4 MX440, 64Mb (NV18).
old speed (AGP mode)
400 fps 400 fps 91.5 fps 59.3 fps 72.6 fps 38.0 fps 47.2 fps 22.4 fps
GeForce4 MX440, 64Mb (NV18).
new speed (PCI mode)
320 fps 320 fps 84.2 fps 56.7 fps 68.9 fps 37.1 fps 46.1 fps 22.3 fps
GeForce4 MX440, 64Mb (NV18).
new speed (AGP mode)
470 fps 470 fps 104.5 fps 64.4 fps 81.3 fps 40.2 fps 50.8 fps 23.2 fps
Table: accelerated DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).


So: Quake 2 now runs timedemo 1 at 105fps in 640x480 @ 16bit on the GeForce4 MX440: I broke the 100fps barrier! Of course, the driver is still very slow compared to the official Windows drivers... But I love it anyway. ;-)

Speaking of this: it's interesting to benchmark Quake2 on Windows using the same card in the same system. It tells you how much speed could be gained 'in an ideal world'. For the GeForce4 MX440 it turns out that even in 1024x768 @ 32bit 70 fps is reached, and probably even more (Windows syncs to retrace, so the screen's refreshrate determines the maximum fps you'll get). You can clearly see (by now) that with the hardware functions we have in use now, we will never get that speed. The card's power for rendering triangles with the function in use is simply too low. So how do they do that speed? They must be using another triangle function, and/or use several 'triangle rendering engines' in parrallel. Quake 2 after all only draws with triangles and is already completely accelerated on BeOS now. And we know we are sending the commands fast enough to the engine or our rendering speed would never reach those 100fps in low-res mode (The number of 3D commands does not depend on the screen's resolution!).

This 'info' gives me thoughts about where to look next for more speed: is the 2D blitting function becoming a big bottleneck in high-res modes? I'll try flipping buffers (only possible for fullscreen apps though). Can I use 'parallel processing' in the engine for the currently used rendering functions? I'll try to 'rotate' entering commands at different offsets in the engine's hardware function. Another approach is sending more than one triangle at once to the engine: but does Mesa support that? All in all, I am not done with this yet. But these experiments will have to wait until a later date: first up we have the alpha2 release. Which BTW is using Mesa 3.2.1: as someone pointed out to me, that version contains numerous bugfixes for openGL conformity compared to Mesa 3.2. And the driver could simply be 'dropped in place'. Oh, no resizing of buffers yet: Mesa 3.2.x seems to contain an internal error preventing that from working correctly. Mesa 6.2.1 is able to do this though...

OK, back to work: Do some final tidbits and release driver. :-)


(Completed) 18 June 2005: I have some benchmarks I'd like to share with you. These benchmarks are taken with nVidia 2D driver 0.49 as it is in SVN right now, combined with the DMA version of the alpha1 3D add-on: which right now has exactly the same functionality and behaviour as alpha1 final as I released it some time ago. Well, I have to be honest: the DMA 3D add-on now nicely supports the NV17/NV18 so you can finally really use them. Cards with NV18 engine are for instance:
I expect the mobile (laptop) versions to work as well, and hopefully also the Quadro types. As usual feedback will be needed to determine the final status of the driver on the cards though.

Along with the 'normal' benchmarks, I also have some comparing benchmarks for you that show differences in speed because of certain new 'items' that are now in the driver set. Indeed, there are several 'big' changes in engine setup in the 2D driver now, all needed to optimize speed for 3D. Of course, this modified setup also speeds up 2D a bit more generally. For instance, it's funny to see that BeRoMeter 1.2.6 nolonger correctly measures speed for 'Graphics Rectangles Unflushed Filled' on most systems: it seems to suffer from a variable rollover in the calculations somehow.

Anyway, here are the results as they are now! By the way, please bear with me a bit more before I do a new release: I am still looking into a few possible additions for the driver to behave a bit better than alpha1 did. Won't be very long though; I'd say I might as well just do seperate releases for these sorts of things..


Current benchmarks, using all new setup 'items' in the driver set.

If you look at the tables below, you can observe some interesting things. For instance, on the dual P3-500 system, you can see that I'm not able to fully load the cards in lower resolution modes (Quake2). It doesn't matter how fast the card is, you won't get more than 30fps out of it. Compare the NV15 with it's speeds on the P4 2.8Ghz, and you'll see that this card can do much more if the system CPU can get it commands fast enough. When you select a high-res mode however, you can see that the card is becoming the bottleneck and the speed difference between the two CPU's is mostly gone. Lookt at 1024x768 @ 32 bit on both systems for the NV15 for proof.

For GLteapot the same sort of thing applies. If the card is fed fast enough, you should encounter speed differences for a card between 16 and 32 bit mode. In most occasions, on most systems, you do not see this however. Another hint: If you enable 'perspective' in the teapot app's menu, you'll see the teapot speed up from 400 to 500fps on the P4 2.8Ghz for example.

These things tell us that if we are able to minimize the software overhead for sending commands to the cards, we can further speedup a lot of hardware combinations. Luckily, Mesa 6.2 is optimized much more than Mesa 3.2 is (6.2's software rendering runs at 150-180% compared to 3.2). So I am hoping that the Mesa version switch planned will further improve our rendering speeds.
Also, the driver itself could probably be faster. Take point and line rendering for example: these are done by seeing these items as a set of triangles. This means we have to feed more information into the engine than would be strictly needed for these functions. In theory, the engine also has specialized commands for such functions, minimizing the overhead: and probably being executed faster as well. The downside is that these hardware commands are less universal, making it nessesary to be able to fallback to the current scheme if the driver would determine that a certain point or line size is outside the engine's hardware capabilities. This fact certainly explains why the point and line functions currently work the way they do: the UtahGLX driver was in it's 'beginning phase' in it's life-cycle.
If you instruct the teapot app to not use filled polygons, you see the rendering speed drop instead of speedup: this is proof of the relatively extra overhead the line function is suffering from. Note however, that it's speed lies a bit above software rendering speed using DMA mode, while with PIO mode it was much slower.

OK, below you'll find the test results for DMA mode. Even further down, you'll encounter comparisons for several driver-setup aspects, and a few comparisons between DMA and PIO mode. Enjoy.


Table: accelerated DMA openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT2-M64, 32Mb (NV05M64) 64-bit bus 85 fps 85 fps 23.5 fps 19.3 fps 20.8 fps 14.8 fps 15.3 fps 9.9 fps
Table: accelerated DMA openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Table: accelerated DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT1, 16Mb (NV04) 110 fps 110 fps 27.4 fps 24.0 fps 23.4 fps 17.9 fps 17.2 fps 8.2 fps
GeForce2 MX400, 32Mb (NV11) 110 fps 110 fps 29.2 fps 25.7 fps 28.0 fps 20.2 fps 21.5 fps 11.7 fps
GeForce2 Ti, 64Mb (NV15) 110 fps 110 fps 30.0 fps 29.0 fps 29.5 fps 26.5 fps 27.4 fps 18.7 fps
Table: accelerated DMA openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Table: accelerated DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

TNT1 PCI, 16Mb (NV04) 250 fps 200 fps 35.0 fps 26.5 fps 25.8 fps 18.3 fps 17.7 fps 9.7 fps (draw errs)
TNT2-M64, 32Mb (NV05M64) 64-bit bus 300 fps 190 fps 37.3 fps 23.0 fps 25.8 fps 15.5 fps 16.1 fps 9.5 fps
TNT2, 32Mb (NV05) 400 fps 300 fps 54.8 fps 38.7 fps 39.6 fps 25.9 fps 26.2 fps 15.7 fps
GeForce2 MX400, 32Mb (NV11) 400 fps 400 fps 55.2 fps 31.6 fps 41.4 fps 21.7 fps 23.5 fps 11.7 fps
GeForce2 Ti, 64Mb (NV15) 400 fps 400 fps 77.3 fps 50.7 fps 59.2 fps 34.1 fps 39.1 fps 19.6 fps
GeForce4 MX4000, 128Mb (NV18) 64-bit bus 400 fps 400 fps 71.0 fps 37.4 fps 49.0 fps 22.6 fps 28.8 fps 12.5 fps
GeForce4 MX440, 64Mb (NV18) 400 fps 400 fps 91.5 fps 59.3 fps 72.6 fps 38.0 fps 47.2 fps 22.4 fps
Table: accelerated DMA openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).


Note please (for all three tables above):
Note also:
NV20 and later cards currently don't work: tested NV28 (GeForce4 Ti4200), NV34 (GeForce FX5200) and NV34Go (GeForce FX5200 in a laptop). It remains to be seen if I can get these up and running.



Comparisons for several driver-setup aspects: DMA versus PIO.

I compared both the 'alpha 1' and 'alpha 2' 3D drivers on two slower systems to find out if DMA mode still gains us speed there even though there's a bit more CPU programming overhead needed for DMA mode. As you can see from the numbers below, 25-30% gain can still be reached in DMA mode (Quake 2), while relative simple things (GLteapot) render a bit slower. Overall DMA mode is faster though, and also has more promiss for speed in the future.


Table: accelerated DMA versus PIO mode openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Card under test:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

TNT2-M64, 32Mb (NV05M64) 64-bit bus in PIO mode 16.1 fps 11.8 fps
TNT2-M64, 32Mb (NV05M64) 64-bit bus in DMA mode 20.8 fps 14.8 fps
Table: accelerated DMA versus PIO mode openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Table: accelerated DMA versus PIO mode openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Card under test:

teapot @16bit:

teapot @32bit:

Q2 640x480 @16bit:

Q2 640x480 @32bit:

Q2 800x600 @16bit:

Q2 800x600 @32bit:

Q2 1024x768 @16bit:

Q2 1024x768 @32bit:

GeForce2 MX400, 32Mb (NV11) in PIO mode 120 fps 120 fps 26.5 fps 20.5 fps 23.5 fps 16.2 fps 17.1 fps 10.2 fps
GeForce2 MX400, 32Mb (NV11) in DMA mode 110 fps 110 fps 29.2 fps 25.7 fps 28.0 fps 20.2 fps 21.5 fps 11.7 fps
Table: accelerated DMA versus PIO mode openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).


Comparisons for several driver-setup aspects: enabled/disabled AGP transfers, MTRR-WC and 1Mb cmd buffer.

I tested speed differences for each of the aspects AGP transfers, MTRR-WC'd cmd buffer and 1Mb sized cmd buffer seperately to see if what I setup really works as should be. The tests where done on the P4 2.8GHz with an original TNT2 AGP. It should be noted that the results even for the fully enabled driver are slower than mentioned in the extensive card-type comparison benchmarks above, as these aspect tests where done on the first stable intermediate version of the 3D/2D driver set. You should also keep in mind that the speed differences will probably become bigger over time, as the driver gets more and more optimized. Probably best view the results as hints that the theoretical setup is correct.

Table: speed differences for different aspects of the DMA driver on a Pentium 4 @ 2.8GHz, FSB at 533Mhz, Dano (gcc 2.95.3).

Aspect combination:

teapot @16bit:

Q2 640x480 @16bit:

Full driver (AGP transfers, cmd buffer is MTRR-WC @ 1Mb) 360 fps 53.2 fps
Full driver minus MTRR-WC 230 fps 49.2 fps
Full driver minus AGP transfers 290 fps 50.3 fps
Full driver minus MTRR-WC and AGP transfers 210 fps 47.3 fps
Full driver in PCI mode 290 fps 50.3 fps
Full driver minus MTRR-WC in PCI mode 210 fps 47.3 fps
Full driver minus 1Mb cmd buffer (now 32Kb)
(not stable!)
360 fps 51.9 fps
Full driver minus AGP transfers and 1Mb cmd buffer (now 32Kb)
(not stable!)
290 fps 49.0 fps
Table: speed differences for different aspects of the DMA driver on a Pentium 4 @ 2.8GHz, FSB at 533Mhz, Dano (gcc 2.95.3).

Note please that although the use of AGP transfers indeed speeds up rendering on the above tested machine, it did not do so for the old P2 system @ 350Mhz. In this case PCI and AGP transfers were both running at the same effective speed. The same applies for the dual P3@500Mhz machine over here. Of course, you should keep in mind that these two systems have an old AGP interface: Version 1.0. The maximum speed those ran at was AGP 2x mode, while the P4 @ 2.8Ghz runs at AGP 4x mode. On top of that, the CPU's are simply not fast enough to really fill up the command buffer so that we could notice the AGP 2x working: the 3D driver version tested does not yet benefit much enough from using AGP transfers for it to show here. Of course, that might change in the future though..


When you look at the results, you can observe some interesting things:

29 May 2005: Hi there. Well, today I have some splended news for you! I was very quiet these last weeks, because I was very busy doing something I feared might not be possible: making the change to DMA mode!. Well, you guessed it: it's official! I have DMA mode up and running smoothly...

I'll update this page asap with more stuff, for now just some facts:
Talk to you later! Wow!!


9 May 2005: Today I am releasing nVidia 2D driver 0.45. This version finally fixes some trouble with older cards outthere: the typical trouble that kept some people using the Be nVidia driver before. We are talking about the trouble I named 'bandwidth trouble': All in all this driverversion is the best version to use alongside the 3D add-on 'alpha 1-final' which I released not too long ago. Get the driver(s) from the Downloads page and have fun!
Oh, by the way: here are the GeForce 2 Ti benchmarks I did with the new 2D driver. These are new 'high scores' for me.. :-)

Table: accelerated openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

Card under test:

GLteapot @ 16bit:

GLteapot @ 32bit:

Quake2 @ 16bit:

Quake2 @ 32bit:

Mesa 3.2 software, no AGP FW 190-210 fps / 1.0x 150-160 fps / 0.78x 2.8 fps / 1.0x 2.8 fps / 1.0x
Mesa 3.2 software, AGP4x + FW 190-210 fps / 1.0x 190-210 fps / 1.0x 2.8 fps / 1.0x 2.8 fps / 1.0x
GeForce 2Ti, 64Mb (NV15) 200-220 fps / 1.1x 200-220 fps / 1.1x 45.3 fps / 16.2x 35.0 fps / 12.5x
Table: accelerated openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).


5 May 2005: 3D add-on alpha 1-final released!

Today I was able to uploaded the source and binary files for BeOS R5, Dano, Zeta and Max edition. Get them from the Downloads page and have fun! Oh, and don't forget to provide feedback if possible: that would be appreciated ;-)


3 May 2005: Well, finally I am able to release a first alpha version of the 3D nVidia add-on driver including Mesa 3.2 library (named alpha 1-final). This driver requires 2D driver 0.43 in order to work (otherwise your system will hang). On top of that, you need to instruct the 2D driver to use PIO mode for acceleration: do so in nv.settings. Make sure you have the 0.43 2D driver running in PIO mode before you run any 3D application with this driver/library (reboot!).

Installing the 3D add-on is nothing more than moving or copying the two libraries libGL.so and libGLU.so into the ~/config/lib folder. For convenience, I added the 2D driver 0.43 including a preset nv.settings file in the downloads. Also included are precompiled demo applications.

If, while testing, you hang your system, you should hit ALT, CTRL and DEL simultaneously (and keep that pressed down for a few seconds) to reboot. If it turns out you cannot run the driver/library just delete the files libGL.so and libGLU.so from your ~/config/lib folder and you should be OK. Without even rebooting: as those libraries are only loaded while you run an app using it.

If you test this 3D add-on you are encouraged to provide feedback. Feedback will tell us best what the actual useability of this attempt is in the end, and where additional fixes are needed. Feedback can be sent by Email or by talkback on the BeBits entry that I'll create ASAP.

OK, below you'll find a table containing the status of the driver, followed by a table indicating the status of that driver for several applications. I hope it's all of use to you! Have fun...


Table: Driver version 'alpha 1-final' status.

Item:

Status:

Accelerated libraries
  • libGL.so is accelerated and contains the 3D add-on driver.
  • libGLU.so is a utility library that runs on top of libGL.so, it contains no internal acceleration. If libGLU accelerates, it does so by using libGL.
Engine access type PIO mode. I'll try to get DMA mode up in the future.
Supported colordepths 16 and 32 bit modes are fully working, 15 bit mode is partially working. 8 bit mode isn't implemented yet. At least 15 bit mode will be completed, but not before switching to Mesa 6.2.
Supported cards NV04 (TNT 1) upto and including (the slow performing) NV18 (GeForce 4MX). I'll try to get more modern cards up in the future.
3D Rendering functions Straight and AA (anti-aliased) point and line functions. Straight triangle function: Mesa 3.2 doesn't support AA triangles. Support for AA triangles will be setup if possible (engine command is known), after we switch to Mesa 6.2. The Depth (Z) buffer is fixed at 16 bit depth.
3D Texturing Single texturing is supported. Multiple texturing will be setup later if possible (engine command is known).
3D States
  • Line and Polygon stippling is not supported (driver falls back to software rendering mode so it renders OK but is slow);
  • Stencil buffering is not supported (driver falls back to software rendering mode);
  • Drawing into the frontbuffer is not yet supported (driver falls back to software rendering state);
  • Single buffering is not yet supported (driver will probably not work at all).
3D driver's 2D Functions
  • Swapbuffers() is accelerated for both BWindow (non-direct) and BDirectWindow (direct) modes;
  • Triple buffering still needs to be setup (Be's libraries do that!). Triple buffering is an easy fix for 'single buffered' contexts (which will actually be double buffered then), and it helps out for slow rendering apps when you move their windows off- and then onscreen again.
    I even suspect the non-direct BWindow mode might work fully correct (position and clipping validity during drawing): will use BGLView's Invalidate() and Draw() functions for 'back' to front rendering instead of doing that in Swapbuffers(). Note that non-direct mode would still be a hack though!;
  • Swapbuffers() uses 2D blits, even for fullscreen. Literal swapping for fullscreen apps will be setup later (engine commands are known);
  • In direct BDirectWindow mode the DirectConnected() clipping_info turns out to be working after all! (apart from the BMenuBar error for which I have a workaround in place). This means you can drag direct windows like a madman, without errors appearing onscreen ;-)
  • In non-direct BWindow mode drawing errors will be made of you move the application window to fast, or if you move other windows over the application window to fast. Hopefully the triple buffering scheme mentioned above will fix these errors...
  • Scaling is not yet supported: if you resize output windows you will see repeating patterns or rubbish in the extra 'room' (scaling up), or you will see only part of the applications output (scaling down). Will be fixed for a future release (engine command should be known).
BView's function activation Be's flags B_WILL_DRAW and B_PULSE_NEEDED are required for BGLView: some applications rely on it without explicitly issuing them. These flags are hardcoded added by the driver when an application creates a BGLView.
Table: Driver version 'alpha 1-final' status.


Table: Status for tested applications.

App name:

Location:

Status:

Mode:

Faults:

Driver or library failsafe:

3Dlife Optional sample code Working as is BWindow/non-direct mode Doesn't call glViewport() Driver calls glViewport() while servicing a backbuffer clear command when it detects no backbuffer was created before. glViewport() takes care of creation among other things.
GLteapot Optional sample code Working after relinking BDirectWindow/direct mode None Relinking against both libGL.so and libGLU.so is required because some items used are in libGL.so these days, while being in libGLU.so at Mesa 3.2 'time'. We are 'going back in time'.
Demo Mesa 3.2 Working after fix BWindow/non-direct mode Doesn't call glPopMatrix() Added glPopMatrix() to the application. A library failsafe can be done in theory, but it requires additional code. It would need to check if the stack still contains a Matrix when Swapbuffers() is called (or so). Checking would be state-dependant.
Sample Mesa 3.2,
BeBook: openGL kit, BGLView
Working as is BWindow/non-direct mode None None
GLQuake for BeOS R4.5 running on R5/dano http://www.aixplosive.de/projects.html Partially working as is BDirectWindow/direct mode Yet unknown This application doesn't work correctly for a number of reasons:
  • It sometimes renders directly to the frontbuffer which the driver doesn't really support yet;
  • It crashes sometimes: this seems a Mesa 3.2 fault: Be's lib is working OK, while the full software Mesa 3.2 fails as well;
  • Some bitmaps are rendered in wrong colors: This seems a driver fault: rendering triangles requires more state-checking (for getting color-info, among others probably);
  • The picture 'flickers': half the time Swapbuffers() is executed while still rendering. The effect is that sometimes only partially-rendered pictures are shown. Seems a Mesa 3.2 fault as well: Be's lib is OK, while the full software Mesa 3.2 fails as well.
Quake II V3.20 http://www.bebits.com/app/1712 Working as is, with some displaying errors caused by Mesa 3.2. (Full software Mesa 3.2 has the same errors.) BDirectWindow/direct mode Doesn't call LockGL() Driver calls LockGL() in the BGLView constructor if gl_get_current_context() returns NULL. Probably needs to be done elsewhere ('later') in the driver so it's actually possible that a context was indeed made current by calling LockGL() at some point in time.
Table: Status for tested applications.


As you can see from all info listed above, the general idea for a 3D library apparantly is to make it work even if the application developer forgets to do some things officially needed. At least this is the road Be took. The downside of that is of course, that certain mistakes otherwise easily located might never be found now...

Having said that, I'll try to be as compatible as can be with the Be (software) libraries (BGLView). Oh, and I'll post some more technical info here about things I encountered during this last phase for developing alpha 1-final. As promised earlier. Talk to you later!


24 April 2005: While trying to finish up for a first alpha release, I decided to test two other demo apps for BeOS: 3Dlife (optional sample code) and the app described in the BGLView section of the BeBook as a sample for openGL use. Upto now I only tried Quake2 and the GLteapot which both work just fine. (Quake2 and the GLteapot are both 'direct mode' apps, while these other two demo apps aren't it turns out.) Well, I bumped into something that was not expected by me: again, I am having trouble with BGLView. Of course, I got some warnings in the past from a openGL user on the BeOS relating to this trouble: but back then, I did not recognize it for wat it was. As before, I can tell you there's no better way (for me) to find out about some stuff than by experiencing it first hand...


BGLView and BDirectWindow/BWindow (or: direct mode, yes or no)

You can use BGLView in a normal BWindow or in a BDirectWindow as you might know. Using it in a BDirectWindow is named: direct mode. When you want to use this mode, you have to manually tell BGLView that it is used in a BDirectWindow by calling it's member function EnableDirectMode(bool enabled). From my point of view the theoretical difference between those two modes is:
Well, from a application's perspective, these two modes are nice to have to choose from I guess. But, here's the problem I am having (as a driver-writer): There's no such thing as a non-direct mode from the driver's point of view! The driver always needs to do hardware blitting into the 'destination' window (openGL: the front colorbuffer) itself. It makes absolutely no sense to not let the driver do that. After all, we want to accelerate. Besides, I have learned that if you feed the app_server a faked BBitmap (faked in that's it's a bitmap class derived from BBitmap which provides the app_server with an adress in the graphicsRAM instead of in main memory where normal BBitmaps reside): the app_server won't draw anything at all. I guess I can understand that: you could see it as a failsafe precaution from the app_server's point of view. (BTW: Getting such a 'faked' bitmap to work wouldn't help in single buffered mode of course.)

Anyway: The conclusion is that the way BGLView is setup, acceleration cannot be done officially for non-direct modes. I have a way around this though: the BGLView workaround code I mentioned earlier, can help us out here too. I already confirmed the 3Dlife demo running OK, although I have to fix one other problem I am having with it. Of course, even if the BGLView clipping_rects error (in DirectConnected()) were to be solved, we would still need this workaround hack for accelerated non-direct mode. Hence: I have some recommendations to make:
Well, back to work I guess: I need to finalize the driver's code to work OK in non-direct mode. And there are a few other things left to fix for the above mentioned non-direct demo apps: both problems are probably non-class related. You might already have guessed BTW that I have to postpone the first release a bit maybe: as soon as I have a final fix on those last issues I'll inform you. Oh, the nVidia 2D driver needs one more flag as well to account for the missing info for 3D acceleration in BGLView's non-direct mode. Good I did not yet release that driver again ;-)


20 April 2005: I just added some more benchmarks in the 12 april post (below) today. This time I tested a Pentium 2 running at 350 Mhz. It's nice to see what a card's acceleration engine can do for you ;-)


19 April 2005: Sorry it's been so long since I posted here. I have to confess Email responses are a bit slow as well.. It's for a good cause, so I hope you'll bear with me for say a week more... :-)

That's right: I'll be releasing the first alpha release of the 3D driver (with)in a week now. I completed the remaining HW funcs (which BTW doesn't speed up things more, just lines and points are acceleated as well now). Also I am perfecting the driver's behaviour for things like modeswitches from the screen prefs panel or within an app (Quake2), or workspace switches. Works good now. The GLView workaround code can be making minor errors after all if an app renders quickly, but for normal use it's more or less working very good. I'd have to say I am very pleased with what I got going in this short amount of time!

OK, more info will follow, including more writing/testing/technical info. For now time's up again. I'll finish by saying something about the first alpha release coming up: Talk to you later...


14 April 2005: Still I have not done more development on the driver, but I have been testing a lot more. In the 12 April post below you'll find more extensive benchmark results now, for both a Dano and R5 system. All tested setups are rock solid, and both R5 and Dano behave the same with my 'workaround' setup for the BGLView clipping trouble I encountered. It's interesting to see how a relatively slow system speeds up relatively much with hardware accelerated openGL BTW...

Well, some people were (more or less) shocked apparantly by the low framerates on GeForce4 MX cards. Personally, I was wondering too what could cause this behaviour. Anyway, it's nice to get feedback: it got me searching a bit more for a cause. I coldstarted the MX4000 card to see what CORE and RAM clocks it gets: that seems to be OK (core = 275Mhz, memory = 265Mhz: not top-notch, but high enough). After looking at nVidia's site for more precise specs I think maybe it's the LMA2 (Lightspeed Memory Architecture) messing up here. I am assuming that feature isn't enabled 'by default', possibly effectively killing the largest part of the memory-bandwidth these cards have via this LMA (LMA's Z-buffer compression feature seems interesting for instance). Of course, we have no specs unfortunately: so we have to live with it for now. Sorry about that. :-/

OK, back to (coding) work now I guess: I should finish up on the remaining acceleration commands that exist in the driver. Until next time ;-)


12 April 2005 (extended 14 april and 20 april): Quake2 and GLTeapot run accelerated!!!!

Hi there! After I did some more testing and benchmarking since yesterday I thought I'd give you a much more precise 'forecast' of what we have here. I can now tell you for instance that the preliminary results I gave you were in fact from a NV18 (GeForce4MX440). Well, let's just say: Don't buy that card! Anyway, read the new results and buy another 'new' card.. ;-)


Benchmarks:

The benchmarks were done on three systems. Cards tested are AGP unless otherwise noted. The 3D driver is still not completed (or further developed), we only have 'base' line and triangle HW rendering in place. Preliminary benchmarking indicated no other functions being used (much) for the GLTeapot and Quake2.

OK, before I give you the bechmarks, I want to share some observations I find interesting:
Table: accelerated openGL 3D speeds on a Pentium 2 @ 350Mhz, FSB at 100Mhz, Dano (gcc 2.95.3).

Card under test:

GLteapot @ 16bit:

GLteapot @ 32bit:

Quake2 @ 16bit:

Quake2 @ 32bit:

Mesa 3.2 software, no AGP FW 35-40 fps / 1.0x 30-35 fps / 1.0x 0.3 fps / 1.0x 0.3 fps / 1.0x
TNT2-M64, 32Mb (NV05M64) 85-90 fps / 2.3x 70-75 fps / 2.2x 19.2 fps / 64.0x 15.1 fps / 50.3x
TNT2 Ultra, 32Mb (NV05) 85-90 fps / 2.3x 85-90 fps / 2.7x 23.2 fps / 77.3x 21.9 fps / 73.0x




Table: accelerated openGL 3D speeds on a 'dual' Pentium 3 @ 500Mhz, FSB at 100Mhz, R5.0.1pro (gcc 2.95.3).

Card under test:

GLteapot @ 16bit:

GLteapot @ 32bit:

Quake2 @ 16bit:

Quake2 @ 32bit:

Mesa 3.2 software, no AGP FW 45 fps / 1.0x 50 fps / 1.0x 0.6 fps / 1.0x 0.6 fps / 1.0x
TNT1 PCI, 16Mb (NV04) 110-115 fps / 2.5x 100-105 fps / 2.1x 23.1 fps / 38.5x 19.5 fps / 32.5x
TNT1, 16Mb (NV04) 115-125 fps / 2.7x 105-115 fps / 2.2x 23.5 fps / 39.2x 19.8 fps / 33.0x
TNT2 Ultra, 32Mb (NV05) 115-125 fps / 2.7x 115-125 fps / 2.4x 29.6 fps / 49.3x 26.8 fps / 44.7x
Note: Results will not be much slower in a single CPU system of same setup, as openGL is currently single-threaded AFAIK.



Table: accelerated openGL 3D speeds on a Pentium 4 @ 2.8Ghz, FSB at 533Mhz, Dano (gcc 2.95.3).

Card under test:

GLteapot @ 16bit:

GLteapot @ 32bit:

Quake2 @ 16bit:

Quake2 @ 32bit:

Mesa 3.2 software, no AGP FW 190-210 fps / 1.0x 150-160 fps / 0.78x 2.8 fps / 1.0x 2.8 fps / 1.0x
Mesa 3.2 software, AGP4x + FW 190-210 fps / 1.0x 190-210 fps / 1.0x 2.8 fps / 1.0x 2.8 fps / 1.0x
TNT1 PCI, 16Mb (NV04) 145-160 fps / 0.8x 130-145 fps / 0.7x 28.5 fps / 10.2x 22.8 fps / 8.1x
TNT2, original, 32Mb (NV05) 195-205 fps / 1.0x 175-185 fps / 0.9x 40.4 fps / 14.4x 31.4 fps / 11.2x
TNT2 M64, 32Mb (NV05M64) 175-185 fps / 0.9x 135-145 fps / 0.7x 30.9 fps / 11.0x 20.7 fps / 7.4x
GeForce2 MX400, 32Mb (NV11) 200-220 fps / 1.1x 190-210 fps / 1.0x 38.2 fps / 13.6x 25.5 fps / 9.1x
GeForce4 MX440, 64Mb (NV18) 90-100 fps / 0.5x 90-100 fps / 0.5x 10.5 fps / 3.8x 10.4 fps / 3.7x
GeForce4 MX4000, 128Mb (NV18) 90-100 fps / 0.5x 90-95 fps / 0.5x 10.3 fps / 3.7x 10.2 fps / 3.6x


Note please (for all three tables):
Note also:
NV20 and later cards currently don't work: tested NV28 (GeForce4 Ti4200), NV34 (GeForce FX5200) and NV34Go (GeForce FX5200 in a laptop). It remains to be seen if I can get these up and running.

Well, I have to say that personally I am very pleased with these results (don't use a NV18 for 3D ;-). And on top of these, it also turns out the system remains rock-solid as usual: I did not encounter any problems yet... :-)

OK, that's it for now. Talk to you later!


11 April 2005: Relocating the texture memory onto the graphicscard turned out to be a breeze: the code worked instantenously. Which is not too surprizing, because this code is not depending on any BeOS specific feature.
Anyway: I just resetup some logging stuff, told the driver to go to active rendering state, and disabled the three actual acceleration hooks: points, lines, and triangles. So, still the same amount of hardware rendering is used as before. Only now, the textures are being placed and used on the graphics card's memory.

I benchmarked GLteapot and Quake2 again: this time there's no further speed decay. For GLTeapot one could expect that, as it doesn't use textures. For Quake2, which does use textures, it's interesting to see the framerate remaining as it is. Apparantly the size of those textures is relatively small, so it doesn't have noticable influence on rendering speed. One other interesting thing I saw, is that sometimes textures are requested from the driver which were never allocated before: this must be the Mesa 3.2 problem with regard to the missing texturing on the Quake2 room floors (I mentioned that problem earlier).


Texturing details:

So what more can I tell you about texturing? Well, as it worked instantenously, not that much, really. Let's just list what I did see:

Further steps to take:

From the looks if it all, it seems setting up a 3D driver can be done nicely by working step-by-step indeed. And, very important: and each step can be tested on it's own. Because of that, the 'variables' that can mess-up each step are kept to a limited number: which makes it all doable. Adding texture support turns out to be one of those steps: it's no problem to map them to the graphicsmemory while keeping their use pure software-based inside Mesa.

The nVidia driver is now in hardware rendering state: which means that in theory the hardware rendering functions and the card-based textures are in use. Luckily, not setting the hardware rendering functions lets Mesa use it's software fallbacks perfectly instead: making the 'pure' texture stuff a seperate step to take. As I already mentioned before, the GLteapot uses lines to draw it's FPS display, while it uses triangles to draw the teapot: The logical next step to take is setting up hardware rendering for lines only. After that the points and triangles functions can be setup one by one: completing the entire basic driver as I planned to do it before trying to switch to other Mesa versions and add DMA support (and such). I'd advice first doing the line function, as it's rendering results are relatively easy to interpret: they are (sort of) 2D after all ;-)

One detail that's interesting to know about is probably the fact that (in the nVidia driver) points and lines have actually two different rendering functions: there are AA (Anti-Aliased) and normal (base) versions (triangles only have a base version in the nVidia driver apparantly). The teapot uses the base version, so that will be my next target. Or better yet: is was my next target, because I already completed it.


The line-rendering (base) function:

Next up was doing the first real hardware rendering function. As I already told you, it uses the same engine command as for instance the clear buffer hardware command, so this should not be very hard to do. Well, this turns out to be indeed true. Initially I hit two problems. I am listing them here along with their solutions: Well, after fixing these two errors the teapot is finally rendering it's FPS readout using a hardware command! Of course, the rendering speed didn't go up: the real work is done with all those triangles. Oh, and quake2 also still renders at 0.4fps over here.


nVidia hardware and coordinates:

As I already mentioned just now, nVidia's Z-buffer apparantly has inverted depth coordinates compared to Mesa's internal one: 'zero' is closeby with nVidia, as opposed to far-away with Mesa.
Interesting to know is probably also that the nVidia color-buffer has it's 2D reference (so 0,0) coordinate at the left-top of the screen (or window), while inside the Z-buffer the 2D reference is at the left-bottom edge: this means that for Z-buffer access the Y-coordinate is inverted inside the driver, just like the Z-coordinate...


Next up:

Now 'all' that's left to do is implement the rest of the 3D rendering functions (points, lines and triangles) to complete the entire UtahGLX driver more or less. That will complete 'step 1' of the roadmap as described in the 3 March 2005 post below. Apart from a major code cleanup that is...


Well, that's it for now. Talk to you later. Looks like we will have acceleration going real soon now!


7 April 2005: OK, here's the (final) update I promised about the subjects backbuffer and clipping. For the purpose of creating a 3D driver that only supports double-buffered contexts, I seem to be done with both subjects now. The results: A perfectly rendering function for blitting the backbuffer to the frontbuffer (so inside BGLView), and much better personal insight in clipping details as needed inside of the driver. Before I fill you in on the details, let's look at the current framerates: So why does the framerate keep on dropping depending on buffersizes/complexity of frames to be rendered? Well, that's just logical, as both the Z-buffer and Backbuffer access have become lots slower due to the bottleneck now sitting in between the CPU and memory used: the 'graphics'bus. Rest assured this will be 'over' once we let the GPU (card's acceleration engine) do the accesses instead of using the CPU. The GPU after all 'nolonger' has this bottleneck as now these buffers are sitting at it's end of the 'graphics'bus...


Clipping trouble:

Unfortunately, I hit some errors sitting in Be's implementation of BGLView. Because I now need to do accelerated back-to-front blits, I needed to setup manual clipping for that so I won't overwrite for instance the Teapot's menubar and it's dropdown lists. Or overwrite Windows that happen to be (partly) on top of 'my' outputwindow. Well: BGLView has a function called DirectConnected() which should contain a list of the so called 'clipping rectangles'. Unfortunately, it does not work correctly. There are two errors: In order to overcome these problems, I created workarounds for both errors. The first problem is overcome by using BView's GetClippingRegion() function, and the second problem is overcome by comparing BGLView's initial given clipping rect to the View's size to find out if a menu exists or not. You see, while the menu's offset is missing, the clipping rect's size is actually correct!

Well, suffice it to say that the resulting code is working surprizingly well actually: I can't get it to malfunction currently (although in theory it could). The good news is by the way that this error should be easy to fix inside Haiku (without loosing compatibility) should they decide to use BGLView in the end. Remember: this is just a 'personal' attempt...

If you want more details on the workarounds, checkout Haiku's app_server mailing list: I posted my findings there just now.


Clipping impact:

You should recognize two different setups for a 3D driver here: single- and doublebuffered contexts. Let's first consider doublebuffered contexts, as this is simpler than singlebuffered contexts.

Double buffering
For doublebuffering, the actual 3D rendering takes place in the backbuffer. This backbuffer is never shown onscreen (for windowed apps), so we don't have to think about clipping around menu's and such there. Only the part of the driver that copies the backbuffer into the window onscreen, has to deal with it: this is the only function accessing the frontbuffer, where (system) menu's (etc) might be shown.

If we have a fullscreen app, we can (later on) simply flip buffers, rather dan doing a copy. This is only possible here, as no other items are shown onscreen: the app has total control over the visible buffer. No clipping has to be done for this setup either: not even for the frontbuffer. The use of flipping buffers will of course further speedup framerates as flipping costs way less time than doing a copy: even if that copy is accelerated. As flipping should be simple to implement, I will add that to the driver later on.

Having flipping in place means that the backbuffer in fact becomes the frontbuffer, and vice versa. A side-effect of that property is that the 'frontbuffer' now needs to be setup using 3D granularity (see below for more info about granularity).

Single buffering
Ah: now we are in trouble! Remember hardware (back)buffer clearing? Well, now we want to do that on the frontbuffer! Which means, suddenly the clearing function has to take clipping into account. And that's not all: every 3D rendering function needs to take clipping into account as well!


Granularities and speeds:

While we are on the subject of single buffered rendering: let's talk about buffer granularities. Here's the thing: 2D acceleration functions have a certain granularity by which we have to abide. This granularity is taken into account by the 2D driver, and relayed to the app_server by use of the frame_buffer_config struct. That way we can use resolutions, that are not natively supported by the engine.

Well, here's the interesting part: 3D functions (might) need larger granularity! On nVidia cards I already confirmed this. So, if we are doing doublebuffering, I can leave the 2D driver setup as it is: after all, copying back-to-frontbuffer is a 2D function! But, if we are going to do singlebuffering, the engine will simply crash: now 3D functions have to directly access the frontbuffer, wich is setup using 2D constraints only (the backbuffer and such are already setup using 3D constraints).

Of course, it's quite easy to patch the 2D driver to setup it's buffer using 3D constraints instead of using 2D contraints: you can look forward to that in one of the upcoming versions of the 2D driver later on.

So: what's the use of granularities anyway? Well, this has to do with buswidths in the GPU, and to/from the graphics memory. By using a large width, the GPU (and RAM) is (are) able to process commands (much) faster than if the buswidths would be lower. Generally speaking: the newer the card, the larger the buswidths it has.


OK, that wraps it about up for today. As you might have guessed, I will now implement the move of the textures to graphics RAM. Talk to you again when I have news about that! Bye! :-)


4 April 2005: Well, BG has been fun, as usual. Only, it gets better each time around! It's also very nice to see the improvements for both Zeta and Haiku each time.. Furthermore two people donated their (older) graphicscards to me so I can finally see why these are still not cooperating as they should: RAM related bandwidth trouble on a TNT2-M64 and a GeForce2 Ti. Now I have those cards to test with I can hopefully finally solve or minimize this problem.

Anyway, back to 3D related news. First, I'd like to point out that I updated the Mesa 3.2 source and dano 'executable' downloads to run Quake2 now. I found myself looking for a HW rendering problem I thought I introduced, so I had to look at my starting point again to see if that was indeed the case. Luckily for me, it was not. The 'numbers' in the middle-lower border of the screen are missing here as well: which means I don't have to look into that as this will be solved automatically when we switch Mesa versions later on. Pfeww! In order to test for this problem, I had to further 'update' Mesa 3.2 which is why I uploaded the source to this site as well. Download it here if you want (although it is still just software rendering of course):

The backbuffer is up and running!

I have the backbuffer up and running nicely. This buffer is now in the same space as the frontbuffer, as opposed to 'normal' software rendering which always uses 32bit space. I have setup a sparse version of accelerated blits for back to frontbuffer rendering: works nicely. This is synced to the (Direct)window via the DirectConnected() function. I need to setup manual clipping to let you see the menu's, and I need to rewrite the frontbuffer write/read bits stuff. I already have rewritten the write/read bits functions for backbuffer access: they work in 16 and 32bit space only atm. The rewrite is needed because while the buffer's space is in 'frontbuffer' depth, the colors handed to those access functions remain to be in 32bit always.

All in all, getting the backbuffer fully up and running is a lot of work, as this is a BDirectWindow like setup. And I still have to fix a few things:
That's it for now. An extra update will be here soon explaining more about frontbuffer clipping and how far this influences 3D rendering. But first: Back to work! 8-)


25 March 2005:I have some very good news for you today: The hardware Z-buffer clear command is now actually working! This means that both the NV10_CONTEXT_SURFACES_ARGB_ZS and NV10_DX5_TEXTURE_TRIANGLE command are up and running. Which in turn means I will actuallly be able to get this driver going, unless I am very much mistaken! Of course it remains to be seen which cards will work with it, and if I can get DMA up and running...

OK, here's the story. After I wrote the previous 'mixed emotions' message here, I started thinking. I realized I had developed the original 2D driver (PIO mode) using XFree 4.3.0 as a reference for specs. Of course the UtahGLX driver never(?) worked with that! So I crosschecked the XFree 4.3.0 driver with the 4.2.0 driver which did work with UtahGLX. After doing this, I saw I had everything in place, and I realized something else: The UtahGLX nVidia driver is still being worked on after all. It seems someone is trying to make it less dependant on Xfree, and to make it work alongside newer Xfree versions. The 3D init code I have in the 2D driver came from the UtahGLX CVS checkout I did late last year. This stuff wasn't in XFree 4.3.0, but it was in 4.2.0. Which means that someone 'moved' it in the UtahGLX driver after the XFree 4.2.0 release. Interesting to see the past 'unravel' :-)

Well, so there was no problem here, and my 2D driver should do enough initing for the 3D driver. The problem had to be somewhere else. Of course, when I looked at my 3D testcode again, I immediately saw a fault: I initialized a command pointer to the wrong command. Don't know how I could have missed that, I was plain tired or so I guess. This fault wasn't the only thing causing trouble though: the card didn't work as expected as well. After I switched back to a NV11 (GeForce2MX400) and removed the pointer fault it all worked at once. Or, to be more exact, I already switched back the day before in an effort to rule out the card as a troublemaker. So I just had to remove the pointer fault this time around. And it works with an unmodified copy of the nVidia 2D driver: V0.41.


HW clearing explained:

So, now that this clear command using 3D vertices and triangles worked, I wanted to know how that could be. I mean, I still thought in terms of doing 2D blit commands to fill a rect (previous attempt ;-). But, when you think about it, using a 'fixed' set of triangles is much smarter! Let me tell you how I think it works. So, how does this clear the Z-buffer? It's simple, actually. By rendering anything, you not only write to the visible buffer (called 'colorbuffer' in openGL terms), but you also write to the Z-buffer. After all, it somehow has to be determined if the next thing you will render lies in front or behind the previous thing you did. Well, by drawing a rectangle with the total screen's size in the utter background: anything rendered next will be closer by and needs to actually update the colorbuffer. Hence, we cleared the Z-buffer.

And how does this clear the colorbuffer? Well, the rectangle we drew lies in exact parallel to our 'viewing position' (the monitor's screen) as we specified the same Z-coordinate for all vertexes, hence resulting in a rectangle on-screen. On top of that, the function gets a 'clearcolor' specified that is used to fill that rectangle. For the Teapot, that's blackness. For Quake2, it's bright-red like. Want proof? Well, just specify to draw to Z position 'zero': so in the utter front of the 'world' we look at. The application won't render anything after that: nothing can be in front of our rectangle this time...


Benchmark results and remaining problems:

I benchmarked a bit with HW clearing of the Z-buffer in place. After all, the speed should go up again, right? Well: right!. Moving the Z-buffer to gfxRAM alone dropped GLteapots framerate to about 35-40fps. Adding 'correct' HW clearing of the Z-buffer increased speed to about 40-45fps again.

Hmm, why do you say 'correct'?, you might ask. Well, the driver currently clears a buffer the size of the total screen, instead of a buffer the size of the Teapot's window. Rendering then is at about 17fps. This I will correct asap: and it did not exist in the UtahGLX driver before. Of course, for fullscreen apps the rendering speed will not be influenced by this. In order to force the driver to do just the correct size (more or less now), I 'enhanced' the NV10_CONTEXT_SURFACES_ARGB_ZS command compared to the UtahGLX/XFree 4.2.0 driver's version: I am explicitly setting the pitches for the color and Z-buffer (it used to be 'just' pre-configured by the 2D driver's acc init code). Testing explicit setting turned out to be rather interesting: it turns out that the engine's granularity for 3D activities is larger than it is for 'just' 2D operations!. For the NV11 and NV18 this turns out to be 64 bytes (not pixels, mind you!), while I think (or rather, I hope) it's even larger for NV28 and some other architectures. I say this because of two things: Anyway, this granularity thing is one of the first things I'll test now, as I'd like to be able to work on my laptop as well :).


The next steps:

Well, after reading the previous stuff, you might have guessed the next 'bigger' steps already:

Interesting side-effects of (testing) HW 3D rendering:

The question of synchronisation between app_server/2D driver and 3Ddriver came up once or twice. I can now tell a bit on this, with evidence. As you might know, 'clients' using the 2D driver are required to AQUIRE_ENGINE before they want to do something accelerated, and RELEASE_ENGINE asap after that. This is the way for instance the app_server and BWindowScreen are serialized when they both want to accelerate (2D) drawing. Well, nothing makes more sense than keeping that system up for 3D as well. Hence, as far as the 2D driver is concerned (more or less), the 3D driver is 'just' another clone of the accelerant (just like BWindowScreen uses one). This means sync between 3D and 2D drawing (i.e. moving the GLTeapot window (== 2D) while it's spinning the teapot (== 3D)) is also done via AQUIRE_ENGINE/RELEASE_ENGINE (a benaphore actually).

Well, I tested this and was punished right at the start. It turned out I had split-up something that should be one3D engine command into three different parts, each doing the AQUIRE/RELEASE stuff. It turns out the engine really gets confused if you insert another (2D) command in between those parts. This is something I can understand however: I was just on the wrong path because in the UtahGLX code these three subparts exists just as that: but they are in fact just one engine command.
So, I combined them to be as one, and then it worked flawlessly. Of course, dragging the window is less fluent when the engine is working on 3D as well, but it works perfectly. I'll optimize the process along the way BTW.

OK: say you are having such difficulty, but you don't recognize it. How can you test for it? That's quite simple: just modify the 2D driver to not export the acceleration hooks, but keep initing the engine. If you now run a 3D app, that will use the engine: but the 2D driver won't. Problem solved? There you go. Worked for me... ;-)

I'll finish for today with an interesting side-effect I encountered while testing all this: I can hear (yes indeed: hear) the engine using power. If I drag the Teapot window (accelerated), I hear the system's power supply (350W I think, testing a passively cooled AGP NV18) complaining (a bit) about the strong fluctuations in power drain: You can hear some noises coming out of it regulating the output voltages, in sync with drawing :-). I heard it with just 2D acc as well (scrolling in sourcecode), but it's definately louder now. While this doesn't matter (it's it's job to regulate after all :), it proves the card actually has to work for us now ;-).


22 March 2005: Well, I got both good news and bad news I quess.

The good news is that I have resetup the lowlevel engine commands, being NV10_CONTEXT_SURFACES_ARGB_ZS and NV10_DX5_TEXTURE_TRIANGLE. I confirmed having access to the engine and it's FIFO, and I confirmed I can issue a 2D related command that apparantly gets executed.

The bad news, however, is that issuing one of those 3D related commands locks up the acceleration engine. I can see the FIFO receiving the commands, but, as they won't execute: the FIFO fills up and never gets emptied again.

I've now got to the point I have to figure out what's wrong by trial and error, combined with painstaking bitwise comparing of my 2D and 3D drivers to the Linux 2D and 3D drivers. This is sort of a point I've been at twice before with nVidia cards: When I setup PIO mode acceleration in the early phase of development on the 2D driver, and just now, when adding DMA mode acceleration to that driver. This, combined with the fact the 3D commands mentioned actually worked on Linux, means that I should be able to find the problem in theory.

I have to admit it tires me a lot though, having to do this: it costs a lot of energy. Sometimes during this I get urges to throw out my computers and never look at them again. Luckily upto now after a day or two of doing nothing, I also get the urge to find out what's wrong. It's in my nature to dig deeper and deeper.. :-) On the other hand, it gets more difficult every time around to overcome the natural resistance I feel on such things.

Anyway: no promises (as usual). I'll do my best and hope I can nail this one. I've come to 'hate' not having documentation however!


20 March 2005: Quake2 is now running (unmodified) on Mesa 3.2. While trying to stabilize the GLteapot I: Well, the Teapot is more stable now (although it's still not perfect, but we'll switch to Mesa 6.2 later on anyway). And quake2 works: which is nice to have for more extensive testing with my current efforts. It's interesting to see that Mesa 3.2 apparantly offers a lot less options than version 6.2: the floors in the quake rooms are perfectly clean for exampe (no textures there). I benchmarked quake2 again just for fun: As a final note for today I have to say the Be debugger (bdb) is proving itself to be invaluable! It's a very handy tool: Can't live without it in user-space! Although I never needed it for my 2D drivers: Yet.


18 March 2005: Yesterday was spent trying to get the Z (so depth) buffer relocated to the space already reserved on the graphicsRAM. In order to do that, I had to hook the driver into the Mesa driver-interface function 'AllocDepthBuffer'. Well, that didn't go as easy as I hoped. After a lot of searching, it turned out you also have to hook in two other functions: 'DepthTestSpan' and 'DepthTestPixels'. Mesa has it's own internal 'fallback' versions of them (like it has for all others as well), but in the case of these two it only uses them as long as AllocDepthBuffer is managed by Mesa internally as well. It thinks tha