
今天,作为Azure的用户,私人邮箱收到一封来自Azure Service的邮件,对上周发生的Azure云计算平台大规模宕机事故做出解释和道歉(http://www.cnbeta.com/articles/347413.htm)。邮件中附带了一个CVP写的blog对整件事做了全面解释: http://azure.microsoft.com/blog/2014/11/19/update-on-azure-storage-service-interruption/

这次事故的影响比较大,波及了Azure 15个数据中心中的12个(Brazil, Australia, China)除外。受波及的数据中心也都比较惨烈,从虚拟机,到网站以及许多其他高级服务全部或者部分不能使用,甚至连用来给用户查看Azure健康状况的dashboard也不能正常工作。到现在总算才把事故的各种遗留问题逐渐清理完毕。



  • Why would you then attempt the upgrade across data centres at the same time? A very important service of ours failed even though we have both north and west Europe deployments. We pay a premium for this and it is now shown to be wasted money.
  • “Why was the dashboard marked all green for Azure West despite our TAM and TAs informing us that the issue was ongoing? We rely on the dashboard to know what the health of Azure is, it seems like its more of a PR stunt now.”
    “There was an Azure infrastructure issue that impacted our ability to provide timely updates via the Service Health Dashboard. As a mitigation, we leveraged Twitter and other social media forums.”

  • “The platform & services are really ingenious and I still trust in this, but you’ll have to improve your behavior much more than a 100% instead of the performance of services.”
  • “For comparison to this anemice blog here is Amazon’s response to a smaller outage they had.
    Its like a 8th grade English paper vs a doctorate thesis.”
  • “During the outage yesterday it wasn’t even possible to kill a server, which means it wouldn’t even be possible to pack up our things and move to a different provider.”
  • It seems unless you pay them for a ‘subscription’, you cant report anything to anyone about something not working, because of a fault on MS/Azure side.
    I’m quite competent, and setup and run my own servers for many years, and I dont think I should have to ‘pay’ to let MS know about problems. I’ve talked to their support in ‘India’, and they seem to only have a very basic understanding of computing or anything, and really couldnt grasp what I’ve been trying to tell them about.
    I’m only running web small sites on Azure, but I would be a bit hesitant of running anything more, as it could just stop working, and then you have no one to talk to or help. 🙁 I feel helpless using Azure when there are problems.

    害怕用Azure,因为一旦它出问题的时候会发现自己很无助,不知道找谁寻求帮助。(我不太理解这个人说的pay them for a ‘subscription’是什么意思,莫非客户支持是额外收费的?)但是any way,根据我上次office 365服务出问题打客服电话的体验来看,确实是有很强的无助感。
  • “the fact that this blog was NOT posted on the Azure twitter account, and the complete lack of interaction with customers on the twitter is shameful. We understand that shit happens, but the human side of this could have been handled much better. How hard is it to get a staff person to sit on twitter and answer peoples questions, provide updates, or at least some sign of life. We were hit at a TERRIBLE time and lost huge amounts of traffic. I really like Azure but the total lack of support and communication here really makes me question why I am paying for this service.”
    “This is a press-conference level event, and you should have staffed PIO positions to keep us in the loop. On a side note – You might want to apologize on twitter, and fire your marketing team who started in on promo tweets without even sending out a link to this blog (at the time of this comment…).”

  • “Actually the problem is down to centralization. There are plusses and minuses to that. When many elements in a system are dependent on one thing, it offers a lot of advantages in terms of shared resource, but puts that central thing on a very critical path. This is essentially why the central planning method of the soviet socialist era became too expensive to sustain. The original idea of the internet was decentralization. Instead the state-corporate machine has co-opted that into a centrally planned economy. Ironic really.”


  1. 工程师有watchdog,不过看到watchdog报警再明白是怎么回事的时候已经来不及了。这个issue比较搞,storage一挂很多VM直接就hang了,就算后面把sotrage rollback也无力回天了。

回复 Sumhat 取消回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注