linux - Program SIGNIFICANTLY slower when used from the TTY

Thursday, 26 September 2019

linux - Program SIGNIFICANTLY slower when used from the TTY

So I have a program written in C++.

It can tell me how long it took to do all the calculations and it does a lot of quite heavy, multi-threaded calculation.

I've just noticed that if I run the program on the exact same machine, it takes around 20-21 seconds to do all the calculations if started from the TTY, and only about 0.2 seconds if I start it from GNOME terminal.

What is causing that? It's literally the exact same file on the same machine.

Answer

Some background theory

Well, both what you work with after CTRL+ALT+F1 and GNOME Terminal are different implementations of the same concept: emulating a so-called full-screen terminal.

The former thing is called a virtual terminal (VT) in Linux, or usually just "console". It uses a special "text-only" video mode still provided by the hardware video cards on the x86-compatible platforms (those of the "IBM PC" heritage, that is). The latter is a GUI application.

Both provide applications running with their help a set of facilities such application expect from "a terminal device" (more details and further pointers—here).

The problem at hand

OK, now let's move to perceived slowness.

I'm sure the crux of your problem is that your program does so-called "blocking" I/O. That is, each time you do something like

std::cout << "Hello, world" << endl;

in your code, first the code of the C++ standard library linked to your application kicks in and handles outputting of the stuff sent to the indicated stream.

After certain processing (and most usually some buffering) this data has to actually leave the running process of your program and get actually output to whatever media your programs sends its output to. On Linux (and other Unix-compatible systems) this requires calling into the kernel—via a dedicated system call (or syscall for short) named write().

So the C++ stdlib eventually makes that write() syscall and then waits for it to complete—that is, it waits for the kernel to say back "OK, the receiver of the data told that it acquired it".

As you can deduce, the receiver of the data your program outputs is the terminal (emulator) running your program—either a Linux VT or an instance of the GNOME Terminal in your tests. (The full picture is more complicated as the kernel won't send the data right into a running terminal emulator but let's not complicate the description.)

And so the speed with which that write() syscall completes highly depends on how fast the receiver of the data handles it! In your case, GNOME Terminal is just does it way faster.

My take on the difference is that the VT driver dutifully renders all the data being sent to it, scrolls it etc while GNOME Terminal optimizes bursts of incoming data by rendering only the tail portion of it (whatever fits the terminal's screen size) and puts the rest in the so-called "scroll buffer" most GUI terminal emulators have.

The takeaways to do

The crucial thing to carry away of this is that as soon as your program performs any I/O along with calculations, and you measure the programs's speed of calculation using "wall clock" timer, you typically may well measure the speed of that I/O, not the speed of calculations.

Note that I/O is tricky: your process can be preempted (stopped with its resources handed over to another process) by the OS any time it's about to wait on some I/O resource to become available for writing—such as hard disk drive.

So the sure way to measure "raw" performance of calculations is to have some facility in your program to disable all I/O. If that's not possible or would be too ugly to implement, at least try directing all the output to a so-called "null device", /dev/null, by running your program like

$ ./program >/dev/null

The null device simply discards all the data passed to it. So yes, still each I/O round performed by the C++ stdlib will hit the kernel but at least you'll have almost constant (and alost instant) speed of writing.

If you need both measures and the data generated, consider creating a so-called RAM-disk and redirecting the output to a file located there.

One more on measuring: note that even on a seemingly idle system running a commodity OS (such as your Ubuntu or whatever), CPU never sleeps—there are always some tasks doing stuff in background. This means measuring computation performance even without any I/O or with sort-of "disabled" I/O (as explained above) will still produce different results on each run.

In order to compensate for this, good benchmarking means running your calculation with the same input data several thousands of times and averaging the results over the numbers of runs.

Notes

Thursday, 26 September 2019