Welcome back to our semiweekly series from Senior Systems Engineer Jim Taylor, bringing you some quick IT administration tips. This week, Jim helps you understand load averages and how to check resource consumption on a Linux server using the top, htop, or uptime commands.
While you might have some fancy monitoring tools in place for your servers, checking CPU utilization can also be as simple as running a quick command.
The top command is similar to Task Manager in Windows or Activity Monitor in MacOS. It returns continuous processor utilization, including the tasks and applications that are currently consuming the most CPU cycles. The command can also be used to sort running tasks by memory consumption or run time.
There are a variety of modifiers you can add to the command line to influence the displayed results (you can find a list here), but I find it is often easier to parse the results from htop.
To install the htop command in Debian, just enter sudo apt-get install htop. After this, you can run the htop command from your terminal command line. The result is a text graph showing total consumption, as well as a list of current tasks. You can select tasks using the up and down keys and kill them with F9. F7 or F8 will lower or raise the priority of the task, allowing you to give more CPU or memory to an important app. Press F6 to switch which column is sorting the overall list.
Both top and htop will also display the load average. The load average consists of three numbers, which are averages displayed as decimal numbers as such:
Load average: 0.05, 0.07, 0.02
One other common command, available on most Linux distributions, will also display the load average: uptime. This is used to find out how long your system has been running (a nice bragging point if you’ve had a server on for several years without any outages, which at Green House Data is of course our constant goal). Uptime displays the time, how long the system has been running, the users logged into the system, and the load average.
Load averages are like traffic over a bridge, where your CPU is the bridge and the cars are tasks waiting to be completed.
Those three digits in the load average are the average total CPU consumption over the last 1 minute, 5 minutes, and 15 minutes.
The decimal represents what total of the CPU is being consumed over that time period. So if it is sitting at 0.0, that means there are no cycles being used by any applications (this is highly unlikely). If it is at 1.0, that means your CPU is at capacity. This isn’t necessarily a problem, but if any other tasks are launched, performance is going to suffer.
If you have a ratio of over 1.0, you then have a run-queue. The total number past 1.0 is your run-queue length, or the total number of tasks or processes that are waiting to run on your CPU. If the average is a multiple of 1.0 (2.0 or above), your machine might be having trouble even returning the top or htop command at all, as it has run out of available CPU resources and has many tasks waiting to run.
If you have a consistent average of over 0.7 for your CPU loads, you probably need to check out your system performance. Something is using too many resources for you to maintain stability.
One wrinkle in all this monitoring advice would be servers with multiple cores. In this case, if you have 2 cores, the capacity increases to a 2.0 load average, because your total available CPU cycles have doubled. With a quad-core processor, the point to be worried is therefore closer to 2.8 – or our previous 0.7 warning average multiplied by 4 cores.
A simple command to see the number or cores is grep -c 'model name' /proc/cpuinfo. If you want to know what is under the hood you can also use less /proc/cpuinfo. Running htop will also show you the number of CPUs at the top of your display.
Armed with the number of cores of your processor, you’ll have a better idea of whether your load averages are inching into dangerous territory. Focus more on the second two numbers of the load average, as brief 1 minute spikes in consumption are not as big of an issue.
Posted by: Systems Engineer Jim Taylor