google.com, pub-8701563775261122, DIRECT, f08c47fec0942fa0
UK

Amazon apologises to customers impacted by huge AWS outage

Amazon Web Services (AWS) apologized to customers affected by Monday’s massive outage after taking some of the world’s largest platforms offline.

Snapchat, Reddit and Lloyds Bank Among more than 1,000 sites and services reported to be down As a result of issues at the center of the cloud computing giant’s operations in Northern Virginia, US, on October 20.

In a detailed summary of the cause of the outage, Amazon said the outage occurred as a result of errors that meant its internal systems were unable to connect websites to the IP addresses computers used to find them.

“We apologize for the impact this incident has had on our customers,” the company said.

“We know how critical our services are to our customers, their applications, their end users and their businesses.

“We know this incident impacted many customers in significant ways.”

While many platforms, such as online games Roblox and Fortnite, became operational again within a few hours after the outage, some services experienced long-term outages.

These included Lloyds Bank, where some customers experienced problems into the afternoon, as well as US payments app Venmo and social media site Reddit.

The outage had a far-reaching impact; It has even been reported to disrupt the sleep of some smart bed owners.

Eight Sleep, which creates sleep “capsules” with temperature and height options that require an internet connection, said it will work to make its beds “interrupt-proof” after slightly overheating and even getting stuck in an inclined position.

Many experts said the outage showed how reliant the technology is on Amazon’s dominance of the cloud computing industry, a market largely cornered by AWS and Microsoft Azure.

The company also said it would “do everything we can” to learn from the event and improve its availability.

In a long summary of Monday’s outageAmazon said the problem occurred in US-East-1, its largest data center cluster that powers much of the internet.

Critical processes in the database that stores and manages the district’s Domain Name System (DNS) records, which allow website URLs to be understood by computers, were effectively out of sync.

According to Amazon, this triggered a “hidden race condition”; in other words, it revealed a dormant bug that could occur in an unexpected sequence of events.

The delay in the process, which Amazon said occurred in the early hours of Monday morning, had a knock-on effect that caused systems to stop working properly.

Most of this process is automatic, meaning it is done without human intervention.

Software engineer and researcher Dr. from the Institute of Engineering and Technology. Junade Ali told the BBC that “faulty automation” was at the root of Amazon’s problems.

“The specific technical reason is that faulty automation corrupted the trusted internal ‘address book’ systems in that area,” he said.

“So they couldn’t find one of the other key systems.”

Like others Dr. Ali also believes it underscores the need for companies to be more resilient and diversify their cloud service providers; so they can “fail over when other data centers and providers are not available.”

“In this example, those with a single point of failure in this Amazon region were in danger of being taken offline,” he said.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button