来看下Azure用户的吐槽

今天,作为Azure的用户,私人邮箱收到一封来自Azure Service的邮件,对上周发生的Azure云计算平台大规模宕机事故做出解释和道歉(http://www.cnbeta.com/articles/347413.htm)。邮件中附带了一个CVP写的blog对整件事做了全面解释: http://azure.microsoft.com/blog/2014/11/19/update-on-azure-storage-service-interruption/

这次事故的影响比较大,波及了Azure 15个数据中心中的12个(Brazil, Australia, China)除外。受波及的数据中心也都比较惨烈,从虚拟机,到网站以及许多其他高级服务全部或者部分不能使用,甚至连用来给用户查看Azure健康状况的dashboard也不能正常工作。到现在总算才把事故的各种遗留问题逐渐清理完毕。

事故的原因很简单,存储服务为了提升性能做了fix,在几周的小规模测试没有问题后同时在各个数据中心部署了这次升级–结果这次升级改变了存储服务前端服务器的一项配置,然后触发了一个bug,服务器无法相应。惨的是出问题的是存储服务,几乎所有其他的服务都依赖于此,所以纷纷躺枪。

认真读了一下上面那篇blog,以及下面大量Azure客户的吐槽,感受颇深。这里列举一些印象深刻的。我相信现在各个企业的云服务或多或少都有类似的问题,这也是为什么很多人/公司对云抱着保守迟疑态度的原因。看到下面的抱怨,或许更能理解什么是客户所想,什么才是更好的服务。

  • Why would you then attempt the upgrade across data centres at the same time? A very important service of ours failed even though we have both north and west Europe deployments. We pay a premium for this and it is now shown to be wasted money.
    很多人问,为什么要在多个数据中心同时做升级?为了做到异地容灾,客户已经花了大价钱在多个数据中心同时部署服务了,可惜还是中招了。。。异地容灾一点作用都没有起到,纯粹浪费钱。
    —-还有人跳出来说,不对啊,15个数据中心不是还有3个好着吗,难道微软是要推荐我们在15个地方都部署么,想赚钱想疯了么
  • “Why was the dashboard marked all green for Azure West despite our TAM and TAs informing us that the issue was ongoing? We rely on the dashboard to know what the health of Azure is, it seems like its more of a PR stunt now.”
    “There was an Azure infrastructure issue that impacted our ability to provide timely updates via the Service Health Dashboard. As a mitigation, we leveraged Twitter and other social media forums.”

    为什么发生事故的时候dashboard还显示一切正常?很多人靠这个来看azure有没有出问题结果失灵了,还是到twitter上才看到的消息。
    —-因为dashboard服务也依赖于存储服务,当存储服务挂了的时候它也挂了。后来微软只好用twitter等渠道来公布消息。
  • “The platform & services are really ingenious and I still trust in this, but you’ll have to improve your behavior much more than a 100% instead of the performance of services.”
    有客户还是信任微软的服务的,但是。。。100%的可靠性比提升性能什么的可重要多了。(工程师式的思维必须要换一换)
  • “For comparison to this anemice blog here is Amazon’s response to a smaller outage they had.
    http://aws.amazon.com/message/…
    Its like a 8th grade English paper vs a doctorate thesis.”
    还有不少人觉得这篇blog写的太烂,amazon出一个局部事故的时候写的声明比这个专业多了。(有兴趣的可以读一下,是专业很多)
  • “During the outage yesterday it wasn’t even possible to kill a server, which means it wouldn’t even be possible to pack up our things and move to a different provider.”
    宕机的时候想把服务器kill掉切换到别家的服务器都不行。。。(只能干等着两三天以后恢复,简直生不如死啊)
  • It seems unless you pay them for a ‘subscription’, you cant report anything to anyone about something not working, because of a fault on MS/Azure side.
    I’m quite competent, and setup and run my own servers for many years, and I dont think I should have to ‘pay’ to let MS know about problems. I’ve talked to their support in ‘India’, and they seem to only have a very basic understanding of computing or anything, and really couldnt grasp what I’ve been trying to tell them about.
    I’m only running web small sites on Azure, but I would be a bit hesitant of running anything more, as it could just stop working, and then you have no one to talk to or help. 🙁 I feel helpless using Azure when there are problems.

    害怕用Azure,因为一旦它出问题的时候会发现自己很无助,不知道找谁寻求帮助。(我不太理解这个人说的pay them for a ‘subscription’是什么意思,莫非客户支持是额外收费的?)但是any way,根据我上次office 365服务出问题打客服电话的体验来看,确实是有很强的无助感。
  • “the fact that this blog was NOT posted on the Azure twitter account, and the complete lack of interaction with customers on the twitter is shameful. We understand that shit happens, but the human side of this could have been handled much better. How hard is it to get a staff person to sit on twitter and answer peoples questions, provide updates, or at least some sign of life. We were hit at a TERRIBLE time and lost huge amounts of traffic. I really like Azure but the total lack of support and communication here really makes me question why I am paying for this service.”
    “This is a press-conference level event, and you should have staffed PIO positions to keep us in the loop. On a side note – You might want to apologize on twitter, and fire your marketing team who started in on promo tweets without even sending out a link to this blog (at the time of this comment…).”

    这两段吐槽大约属于PR范畴了,事故发生的时候在twitter上发了个消息就没下文了,没有和客户的互动:回答问题,及时更新状况。(我也同意这样的事情应该由公司统一的PIO-新闻发布官出面,不知道Azure找了个什么人发的twitter。。。)人在无助的时候,哪个有个人陪着说说话感觉也会好一点吧,做service也是这样
  • “Actually the problem is down to centralization. There are plusses and minuses to that. When many elements in a system are dependent on one thing, it offers a lot of advantages in terms of shared resource, but puts that central thing on a very critical path. This is essentially why the central planning method of the soviet socialist era became too expensive to sustain. The original idea of the internet was decentralization. Instead the state-corporate machine has co-opted that into a centrally planned economy. Ironic really.”
    最后这个属于哲学探讨,有兴趣的话可以讨论下。互联网的核心和本质就是平等(去中心化),云计算确恰恰相反,它要把大量的资源集中在少数的数据中心。这样在带来低成本高资源利用率等好处的同时,也会使互联网变得脆弱,因为一点中心节点发生问题就全部瘫痪。(这个人有句话比较有意思:集中化不好的著名例子就是苏维埃社会主义的失败。。。)

《来看下Azure用户的吐槽》有2个想法

  1. 工程师有watchdog,不过看到watchdog报警再明白是怎么回事的时候已经来不及了。这个issue比较搞,storage一挂很多VM直接就hang了,就算后面把sotrage rollback也无力回天了。

发表评论

电子邮件地址不会被公开。 必填项已用*标注