Table of contents of the article:
There are outages that are remembered for a few hours. Then there are those that become an internal issue, an operational reminder, a lesson never to be forgotten. And then there are the episodes that deserve to be recounted publicly, because they stop being a simple technical incident and become a vivid demonstration of a much broader problem: the fragility of certain cloud services when commercial theory clashes with practice, and when support stops being support and turns into a referral switchboard.
VPS Downtime in Cloud.it
This story begins Saturday, April 4, 2026, at 14:49 PMAt that moment the Cloud instance belonging to the Aruba Cloud.it service and with the node name “MANAGEDSERVER.IT”, becomes completely unreachable. We're not talking about a degraded service, abnormal latency, or a few lost packets. We're talking about a total blackout. No reachability, no operation, no real control from the control panel. The ticket is opened at 15:01. From there begins a sequence of hours that turn into days, until only the restoration April 7, 2026, at 10:30 AM. In total, almost 67 hours and 41 minutes of inactivity.
Sixty-seven hours. SIXTY-SEVEN! More than two and a half days. In 2026. On a cloud service.
This alone should be enough to raise more than a few eyebrows.
The non-existent management of the problem
But the point isn't just the failure. Failures happen. They happen to everyone; they happen to Google, Microsoft, AWS. They happen. A physical node problem happens, a storage fault happens, an event triggers a serious outage. No one who actually works in infrastructure is shocked by the abstract idea that something could break. The point is something else: what happens next.
And this is where the matter stops being a banal interruption of service and becomes an exemplary misadventure.
Because in the hours after opening the ticket, and then in the dozens of hours that followed, the overwhelming feeling wasn't that of a supplier dealing with a serious but managed problem. The feeling was that of being plunged into a gray area of automated formulas, empty reassurances, and an almost total absence of useful technical information. "We're checking." "We've contacted you." "We'll contact you." Phrases that serve to occupy space, not fill it. Phrases that serve more to buy time than to restore visibility to the customer. And meanwhile, time passed, the system remained frozen, and no concrete explanation arrived from the other end regarding the cause of the outage, the true status of operations, the critical issues encountered, or even a minimally reliable estimate of restoration times.
This is the point that irritates the most about the down. Infrastructure can fail. But the management of that failure can't afford to fail in return.
After more than 18 hours of blackout, a certified email is sent formal warning and default noticeThen, after 33 hours, another detailed reminder. Then, after 50 hours, another certified email reiterating the obvious: it's not just a lack of resolution that's lacking, but communication as well. The bare minimum expected when dealing with a provider selling a cloud service intended, at least in its presentation, for professional use is lacking.
At one point, two alternative routes were even requested, both reasonable: Either restore the node, or deliver a snapshot of the machine in OpenStack format, allowing for an autonomous migration. Nothing. Neither. Just that toxic limbo where the customer doesn't have the machine, doesn't have the data, doesn't have an ETA, doesn't have a technical contact to tell them what's happening, but they keep hearing that "they'll get back to them."
It's in moments like these that the cloud shows its worst side. When it works, everything is convenient: scalability, flexibility, rapid provisioning, dashboards, automation, marketing. But when something really breaks, and it really breaks, the customer suddenly discovers the most uncomfortable part of the deal: the infrastructure isn't theirs, control isn't theirs, access to the physical layer isn't theirs, and the speed of response depends entirely on the provider's expertise, transparency, and reliability. If these three things are missing, the cloud stops being a convenience and becomes a cage.
In our case, the difference between a major outage and a much more serious damage was made by a factor external to the provider. After an initial downtime of about forty minutes, having noted the lack of any useful feedback and sensing that there would be no quick recovery, we activated our ZFS replication on HetznerThat was what saved us. Not the service's native resilience. Not the support. Not the phantom protective umbrella evoked by the term "cloud." Salvation came from an independent replica, geographically separate, outside of that infrastructure, ready for use. Thanks to that replica, the institutional website came back online after the first forty minutes of downtime, just as the provider had received no indication of a timely resolution.
And this is where we need to stop for a moment, because the lesson is important.
The cloud isn't backup. The cloud isn't disaster recovery. The cloud isn't geographic replication. The cloud, by itself, is simply someone else's computer. And when that computer disappears for nearly three days, suddenly all the glossy talk about business continuity begins to show its cracks.
Adding insult to injury, ALS adds insult to injury.
At this point the question of the SLA , because this is where many companies delude themselves into thinking they're protected. The SLA published by Cloud.it is the document “Aruba Cloud Computing Service Service Level Agreement (SLA)”, public version 1.3, effective from July 31, 2025The document distinguishes between several scenarios. For cloud services in general, Aruba declares a 99,95% uptime on an annual basis for internet accessibility to the Data Center infrastructure and a further 99,95% on an annual basis for the availability of the physical nodes that host the customer's virtual infrastructure. For some specific types, such as VPS Openstack Starter e VMware VPS, the level drops to 99,8% on an annual basis both for accessibility and availability of physical nodes. Furthermore, scheduled maintenance is not counted and should be communicated at least 48 hours notice.
Translated into concrete terms, 99,95% annual uptime means a theoretically tolerated downtime of approximately 4 hours and 23 minutes per year. 99,8% instead means approximately 17 hours and 31 minutes per year. Now, let's put these numbers next to what actually happened: almost 67 hours and 41 minutes of downtimeEven assuming the most benign scenario for the provider, that is, the broader threshold of 99,8%, we're not dealing with a marginal overshoot here. We're dealing with an incident that completely overshoots the promised limit. If, however, the service fell within the standard 99,95% scenario, the overshoot would be even more dramatic. In both cases, however, we're not talking about a small statistical deviation: we're talking about a chasm.
But then, concretely, what should one have expected from an ALS of this kind?
First of all, one would have expected the uptime promise not to be a decorative number placed there to decorate the price list, but the expression of a coherent operating model: effective monitoring, serious incident management, clear communications, real escalation channels, and the ability to offer workarounds. An SLA isn't just a percentage. It's the practical implication that, if a serious outage occurs, the provider has a machine ready to handle it. In other words: one doesn't expect the impossible, but one does expect expertise, transparency, and timeframes commensurate with the severity of the event. After a few hours, at the very least, one expects a probable cause. After dozens of hours, one expects a plan. After more than two days, one expects a solution or at least a concrete alternative to allow the customer to overcome the outage. Here, however, what emerged was the opposite: an SLA that, faced with a real incident, prevented neither the prolonged blackout nor the communication gap.
The “non-existent” reimbursement for the damage suffered.
Then there is the theme of the reimbursement, which is perhaps the most ironic part of the whole story. For virtual infrastructure created and allocated by the customer, Article 6.2 provides a credit equal to 5% of the total expenditure generated in the 30 days preceding the outage, or of the previous month for monthly paid services, for each complete fraction of 15 minutes of service disruption beyond the limits set by the SLA, up to a maximum of 300 minutes. Since 300 minutes is equivalent to 20 complete 15-minute blocks, the maximum credit actually reaches 100% of the expenditure for the affected portion of the virtual infrastructure within the reference period. The credit request must be submitted within 10 days of the end of the outage via ticket, and only service disruptions confirmed by Aruba's monitoring system are considered valid. For paid monthly cloud services, the document also specifies that No further reimbursement is due for the period of inactivity, except for the credit provided for in Article 6.2. The text also adds that, during the period of inactivity, the cloud service does not generate any costs, and that any amounts erroneously charged must be credited back to the management panel.
Here the paradox is almost offensive in its clarity.
Since the outage lasted considerably longer than the limits tolerated by the SLA, the credit mechanism most likely reaches its limit. theoretical maximum, that is 20 euros, TWENTY.
In other words, barring exceptions applied by the provider or disputes regarding the qualification of the service, the financial compensation to which one would be entitled would likely reach 100% of the impacted infrastructure spending in the previous 30 days, or al 100% of the last month's salary If it's a monthly service. No more. Not 200%. Not the cost of the damage. Not the lost revenue. Not the wasted man-time. Not the damage to reputation. Not the operational anxiety. Not the cost of emergency migration. Not the potential loss of positioning. Simply, at most, the equivalent of a month's salary or the last usage of the affected block.
And it is here that ALS, as a protection tool, shows all its practical inadequacy.
Because faced with nearly three days of unavailability, with open tickets, certified emails, phone calls, and a total lack of any real technical support noticeable to the customer, the maximum financial remedy available is essentially a credit that can offset the cost of the service during the reference period—twenty euros, the cost of a pizza and Coke at a suburban pizzeria. A sort of "we'll refund your fee" disguised as contractual protection. From a strictly contractual perspective, the provider will claim to have clearly defined the scope. From an operational perspective, the customer realizes that there's a huge gap between the value of the actual damage and the value of the SLA credit.
Not only that. The SLA itself contains a series of exclusions: force majeure, urgent extraordinary security or stability interventions, problems attributable to the customer, third-party software malfunctions, failures external to the Aruba network, and so on. These are fairly typical clauses, and there's nothing surprising in themselves. But when the financial compensation is already modest, the presence of a broad list of exclusions makes it even clearer that this type of document primarily protects the supplier, and much less the customer.
Conclusions of the bad experience.
Ultimately, then, the real question isn't even whether the disruption was serious. It undoubtedly was. The real question is something else: Does it make sense to use such a service as the sole pillar of an infrastructure, if it is not supported by independent backups, geographical replication and an autonomous disaster recovery strategy?
As far as we are concerned, the answer after this experience is brutal in its simplicity: No.Or rather, not alone. Not without a serious plan B. Not without the awareness that, when the provider fails and support takes refuge behind prefabricated formulas, the only thing that truly saves you is what you've built outside of them.
In our case, it was a ZFS replication on Hetzner that got the institutional website back up and running after the first forty minutes of downtime, even as no meaningful feedback arrived from the other side. And it's hard to imagine a more severe outcome than this. For nearly three days, the promise of the cloud evaporated. Business continuity took a turn for the worse.
Which brings us to the most honest, and also most uncomfortable, conclusion: an SLA may offer you some credit. But it doesn't make up for the lost time, the eroded reputation, the burned trust, and the absurdity of having to chase for nearly sixty-eight hours a service that should have simply stayed on.