Table of contents of the article:
As summer arrives, outdoor temperatures rise significantly, and while data centers are designed to maintain a controlled environment, outdoor heat can still affect the internal temperatures of servers. This can lead to various issues that, if not handled properly, can cause server slowdowns or even unexpected reboots. In this article, we will explore how summer temperatures can unearth latent problems in server cooling systems and how to address these issues.
Impact of high temperatures in data centers
Data centers are equipped with advanced cooling systems to maintain a stable and safe temperature for servers. However, during the summer, and especially on small business data rooms that aren't quite up to industry standards, the heat load can increase due to external heat, putting pressure on these systems. Even small increases in temperature can have a significant impact on server components, especially CPUs, which generate a lot of heat during operation.
Common problems caused by heat
- Fan failure (FAN): Fans are essential for dissipating heat from CPUs and other components. Over time, fans can wear out and stop working properly, reducing cooling effectiveness.
- Little thermal paste dissipation: Thermal paste is used to improve heat transfer between the CPU and the heatsink. If the thermal paste is exhausted or no longer compliant, cooling efficiency decreases, causing the CPU to overheat.
- Reaching the temperature threshold (Threshold): Many servers are configured to automatically shut down when the CPU temperature exceeds a certain threshold to prevent damage. This can lead to sudden reboots if summer temperatures push CPUs beyond these limits.
- CPU throttling: When a CPU reaches high temperatures, it may begin to reduce its clock speed to reduce the heat generated, a process known as throttling. This can cause significant slowdowns in server performance.
Diagnosing heat-related problems
Diagnosing heat-related problems can be relatively simple in person by directly observing the server's physical components. However, for an inexperienced user or system administrator, it can be more difficult to identify these problems without the proper tools. This is where the usefulness of software tools like lm_sensors.
What is lm_sensors?
lm_sensors is an essential software tool for monitoring temperature, voltage and fan speed on Linux systems. This tool allows you to obtain real-time data from sensors integrated into server hardware components, making it easier to diagnose overheating and cooling problems. lm_sensors is especially useful for system administrators who want to keep their hardware in optimal condition, preventing failures due to overheating or fan malfunctions.
Installing lm_sensors
Installing lm_sensors varies depending on the Linux distribution you use. Below, we provide instructions for the main families of distributions: Red Hat derivatives (such as CentOS and Fedora) and Debian derivatives (such as Ubuntu).
Derived Red Hat distributions
To install lm_sensors on Red Hat-based distributions, such as CentOS, Fedora, or RHEL, you can use the package manager yum
o dnf
.
Derivative Debian distributions
To install lm_sensors on Debian-based distributions, such as Ubuntu and Debian itself, you can use the package manager apt
.
Functions of lm_sensors
- Temperature monitoring: Provides accurate temperature readings of various components such as CPU, GPU and motherboards.
- Check the voltages: Monitors supply voltages to ensure they are within safe operating limits.
- Fan control: Measures the speed of your fans to make sure they are working properly.
- Threshold configuration: Allows you to set temperature and voltage thresholds to activate alarms in the event of abnormal values.
Case study: Analysis of the uploaded image
In the image uploaded below, we see an example of command output sensors
on a Linux system. This system had rebooted itself twice in one morning. We analyze data to identify problems.
Detailed analysis
- CPU temperature: One of the first indicators of overheating problems is the temperature of the CPU. In the image, we see that the CPU temperature (CPUIN) is extremely high, reaching 90.0°C. This value far exceeds the alarm threshold set at 80.0°C. The alarm threshold is a predefined limit that, if exceeded, indicates that the CPU is operating at a dangerously high temperature. Exceeding this limit not only reduces server performance but can also permanently damage hardware components. Such significant overheating suggests that the cooling system is not working properly.
- Fans (FAN): Another crucial aspect to consider is the operation of the fans. Fans are responsible for maintaining a safe operating temperature for the CPU and other components by dissipating heat generated during operation. In the output, we notice that all fans (fan1, fan2, …, fan7) show a speed of 0 RPM. This is a clear sign that the fans are not working. Failure to rotate the fans means there is not enough air circulation to cool the server's internal components, quickly leading to overheating.
Diagnosis
The main problem in this case is the broken fans, which led to the CPU overheating. With all fans idle, the heat generated by the CPU is not effectively dissipated, causing the temperature to rapidly rise to critical levels. This triggered the server's automatic shutdown mechanism to prevent permanent damage, leading to sudden reboots.
Solutions and recommendations
- Replacing the fans: The immediate solution is to replace the failed fans to restore adequate airflow and cooling.
- Checking the thermal paste: Check the condition of the thermal paste and replace it if necessary to improve heat dissipation.
- Continuous monitoring: Use tools like lm_sensors to constantly monitor temperatures and fan speeds, setting alarms to prevent future overheating problems.
- Power inspection: Check the power supply voltages to make sure there are no problems with the power supply or power distribution.
Conclusion
Summer temperatures can take a toll on servers, even in the best-equipped data centers. Issues like broken fans and spent thermal paste can go unnoticed until external heat brings them to light, causing sudden slowdowns and restarts. Using tools like lm_sensors, it is possible to monitor the condition of hardware components in real time and intervene promptly to avoid damage and service interruptions. Preventive maintenance and continuous monitoring are essential to ensure servers run smoothly even in the most extreme conditions.