Reddit had a 6 hour downtime that was caused by them running a database backed by Amazon's cloud disk storage product (EBS Elastic Block Store), but EBS is both unreliable and doesn't flush writes to disk when told to. This led to corrupted data and disagreements between the master database and slave databases, causing the slaves to not be usable while the master was down.
Modern SQL databases are not written to work right on hardware that does not have a reliable flush to disk operation. Correct functioning of a modern database absolutely requires that "write back" caching can be shut off, so that if a disk reports a write succeeded that the write actually succeeded. See for example http://www.postgresql.org/docs/current/static/wal-reliability.html ; this is not at all unique to postgres, I believe that every sql database requires committed writes to actually commit to disk.
The law of leaky abstractions says that as we virtualize more we can do things that appear to work but don't really actually truly work exactly how we think they do, and Murphy says it will hit us with public downtime.
Row ID #770 - Bob submits article about puppies. Master says it commits, so data is sent to slaves. Master lied, data was actually in a cache somewhere, write later actually fails in the master - but succeeded in the slaves.
Row ID #770 - John submits article about kittens. Master now has kittens, slaves have puppies, dogs and cats living together mass hysteria reddit is down for 6 hours migrating master data to new hardware and manually hacking up rebuilt slave tables.
I don't know how this works with nosql systems and eventual consistency. Is cassandra ok to run on ebs disks but postgres not at all? Reddit says their solution is they are going to move to using local ec2 disks; is that actually a solution, or does it just make hitting the problem less likely because ec2 local disks are more reliable than ebs? Do they still do write back caching?
Meanwhile, Netflix has pointed out that they have actually moved most of their functionality into the cloud. Specifically, most everything that scales with customers and streaming usage is now served from clouds (although movies come from CDNs, not Amazon's EC2.)
Netflix has posted some really interesting information about the testing they did on EC2: http://perfcap.blogspot.com/2011/03/understanding-and-using-amazon-ebs.html
And their lessons learned is a great place to start when considering working in the cloud at scale: Netflix 5 Lessons We’ve Learned Using AWS
The upshot is, scaling by working in the cloud leads to a whole new set of challenges. You have to invest more in writing your software to handle hardware failure, you have to test failure scenarios more, you may have to go so far as to redesign network protocols to be less chatty because you have unpredictable latency from shared systems, and you have to expect problems from when abstractions leak as layers of complexity are added to what used to be a simple operation like “write this to disk”. If you’re at the point where your hardware costs from scaling exceed your software development costs, or if you truly need to be able to handle rapid customer growth faster than you can expand traditional data center use, it can make a lot of sense to tackle these challenges. But it’s not a no-brainer no-effort proposition – development and testing are going to get harder to handle new scenarios as you switch to using a larger quantity of less reliable resources.
Netflix uses simpledb, hadoop, and cassandra.