Issue with Hardware Load

Hello,

We recently bought 2 HP Hardware with 2 CPU Socket (20 cores / 40 Threads) along with 500 GB of RAM.

And all users have issues with Google Chrome, it is very slow and the load on the server is pretty high.


The numbers of the users is identicaly the same than the previous day:

It seems I have underestimate the ressources needs.

Does anyone have an idea of how to improve this ? Or do we need to rethink the topology and add more servers with less memory ?

@s0p4L1n hard to know for sure, it depends on your usage patterns and available hardware. As a guess: if users are doing things in the browser which require graphic acceleration and there is no GPU available, then this could result in high CPU usage and poor performance.

From within your ThinLinc session, run:

glxinfo | grep "OpenGL renderer"

Open Chrome and browse to chrome://gpu. Check the value for GL_RENDERER.

If llvmpipe shows up in both these places, then the CPU is being used for graphic acceleration.

1 Like

Like @aaron said, it depends on what your users are doing, but to me, it seems like you should have sufficient resources for typical use.

Is it only Chrome that runs slow in the sessions?

Are users on the tl-beta-d11 node having a better experience (that machine looks to have a lower load in your screenshot)? If not, is it possible that other factors are playing a part here? I’m thinking of stuff like network latency or bandwidth issues.

What desktop environment are you using? Desktops such as GNOME are typically not designed to run well in remote desktop scenarios and could very well be a reason for the high CPU load you’re seeing.

1 Like

We had a peak CPU usage during 45 min, the load was 215 at its peak on tl-beta-d11, the load on the -tl-alpha-d11 were almost the same, with only 35 users (load balanced between the two nodes)

The thinlinc client accepted users login, but did not wanted to launch Chrome as it was very slow.

I did a reboot of the tl-alpha-d11 (as it was not working and did not impact), after this the load came back to its normal value.

During the afternoon, there was no more issue and the load average were between 5-15.

Both nodes are redundant on switches, with 2x10Gbps QSFP+ cable

I have installed XFCE as desktop environment.

For now we did not detected what was the issue, maybe I will discuss on GPU acceleration with my chief, I did not think about that at the time we decided to buy dedicated hardware servers.

All employees using Thinlinc for now are graphists which needs youtube for references, and It could be this.

I’m trying to find out an alternative like a Youtube client which do not have upload features, so it will allow them to view directly on the workstation instead in Thinlinc, avoiding CPU high peak.

We only allow Google Chrome inside Thinlinc sessions, when an employees connects, it directly open Google Chrome.

We do not have GPU on the HP Server.

Which graphic card do you recommend for Thinlinc ? The public one RTX or enterprise one Quadro ?

If I’m not mistaking, decoding video in the browser / youtube is always CPU bound, and having a real GPU won’t make any real difference in this case? @aaron @samuel what’s your take on this?

/Martin

Chrome is capable of offloading certain graphical operations to the GPU via WebGL, as well as decoding video. But again, whether this is what’s causing the increase in load depends on usage patterns. It may be something else.

VirtualGL has just received an EGL front-end, which allows it to be used with recent versions of Chrome and Firefox for GPU offload.

As far as GPU selection is concerned, this is really outside the scope of ThinLinc. Some investigation would be required to determine which is best suited for your particular use case.

We just had today the issue, the load was high up to 900 !

We did kill all chrome process with:

ps ax|grep /opt/google/chrome/chrome | awk '{print $1}' | xargs kill -9

The load decrease from 900 to 200 instantly, but as the users log in again, it came back…

I rebooted the tl-alpha-node and now the load of the tl-beta is now 20.
But the load came back on 150 on the tl-alpha.

As Google Chrome is the only app used on Thinlinc by user, we tried to identify what is cosumming everything but we can’t identify on which chrome tab there is Youtube.

I’m pretty sure this is the bottleneck but I can not confirm 100%.

1hour later it is now back to its normal state: 60 load and 45 load on each nodes. But mayne not for so long.

I understand this isn’t a solution, but have you tested switching to a different browser like Firefox? It might be worth investigating a bit.

We do not intend to use Firefox, we decided internally to use Google Chrome as default browser. Thus, in the past experienced, the resources consumptions are pretty the same now.

What I do not understand is that we tested the solution on virtual machine with 2 servers in same configuration before deploying in production.

The 2 VM had 48GB RAM and 8vcpu allocated. There was 15/20 people on the solutions.

Now on 2 hardware servers each with 500GB RAM and 2 socket of 20/40 CPU/Thread, it does not support 40 users.

Yesterday there was the same amount of users connected and no issue were told by users.

Also, the speedtest result from the Chrome is bottleneck in download,

While the speedtest from the command line of the server result 800/800 DOWN/UP.

I understand the issue is not Thinlinc itselft, we will try to find what is causing this. And if needed adding 2 more servers, but even if the load is high, the CPUs from top command are seen idle almost or only at 5-6% used.

I have monitored resources consumptions:

I found the script here:

#!/bin/sh
#
# print total CPU usage in percent of currently logged in users.
#
# 1st column: user
# 2nd column: aggregated CPU usage
# 3rd column: normalized CPU usage according to the number of cores
#
# to sort by CPU usage, pipe the output to 'sort -k2 -nr'
#

set -e

own=$(id -nu)
cpus=$(lscpu | grep "^CPU(s):" | awk '{print $2}')

for user in $(who | awk '{print $1}' | sort -u)
do
    # print other user's CPU usage in parallel but skip own one because
    # spawning many processes will increase our CPU usage significantly
    if [ "$user" = "$own" ]; then continue; fi
    (top -b -n 1 -u "$user" | awk -v user=$user -v CPUS=$cpus 'NR>7 { sum += $9; } END { print user, sum, sum/CPUS; }' | sort -k2 -nr) &
    # don't spawn too many processes in parallel
    sleep 0.05
done
wait

# print own CPU usage after all spawned processes completed
top -b -n 1 -u "$own" | awk -v user=$own -v CPUS=$cpus 'NR>7 { sum += $9; } END { print user, sum, sum/CPUS; }'

Only 5 active users:
image

But the load is 166:

It is messed up ! ^^


Average CPU % (2 x 20C/40T) = 1,9%
Load average 111.

And we can not even click on a tab or navigate.

Pretty high CPU Wait % there… you could investigate this a bit further if it could be related to disk IO perhaps?

Yes, we just “found” or maybe understand that it is maybe because of the NFS Server used for the central /home repository.

It is a QNAP TS-469L, with only 1GB of memory and it seems to be the bottleneck.

I will keep you update after some testing tonight when employees are out of the officle.

And will present my apologies for bothering the forum if the problem is this one, because it is our fault, by having high hardware ressources but not for the /home.

2 Likes

Hello @s0p4L1n

No need to apologies at all. This thread can be of use for someone in the future for sure, especially in the thought process of ruling things out when debugging such issues.

Please keep us updated, and thank you for sharing your case!

Kind gregards,
Martin

Hello, the NFS Central home repository is migrated to another server.
Well it’s been 2 hours there is 40 users load balanced between the two nodes and the load is very low.

I will wait 1-2 more days but for me it is clear that the QNAP NAS was the bottleneck.

Thank you for your patience and you help.

1 Like