Top 10 Reasons to Avoid the SimpleDB Hype

There is a ton of chatter on the Internet about Amazon SimpleDB, Apache CouchDB, Google App Engine’s Datastore API, and other distributed key-value data stores. Their biggest perceived advantage is scalability: they can help eliminate the bottleneck imposed by single-server databases.

But the hype around these new databases is growing frantic. This morning I read an article by Todd Hoff which fawned over SimpleDB’s unconventional rules to such an extent that I thought it might be satire. There are some significant drawbacks to developing in this new database paradigm. In fact, many of Mr. Hoff’s supposed advantages are actually serious disadvantages to the paradigm. Before designing your architecture around a database engine like SimpleDB, it’s important to consider the reasons not to do so.

Most of my points are directed at the Amazon SimpleDB service, but many also apply to other databases like CouchDB and the Google Datastore.

1. Data integrity is not guaranteed.

Data stores like SimpleDB don’t support the same rigorous constraints that RDBMSes do. Some of these databases support single-row constraints, like requiring data in certain fields, but it is nearly impossible for these systems to enforce UNIQUE constraints and foreign keys.

Programmers can work around this by issuing extra queries to confirm an update is valid, but this requires a lot of extra work. This will never be perfectly accurate–it may be impossible to avoid race conditions when two clients simultaneously attempt conflicting updates. And it’s especially difficult with SimpleDB because SimpleDB doesn’t guarantee that a client sees all the recent updates to the data.

2. Inconsistency will provide a terrible user experience.

Speaking of inconsistency, it’s critical to shield users from
this property of SimpleDB.

SimpleDB is optimized for fast writes. Your API calls return as soon as the data is written to the SimpleDB service, but before it’s replicated across all of the SimpleDB servers. If you issue any queries before the data is propagated, you won’t necessarily see your most recent change.

When I save my changes in your web application, I expect that your system will show me a consistent view of those changes. If you show me the data that’s in SimpleDB, my changes might not appear, and I’ll probably get confused. In fact, I will probably freak out, thinking that you lost my data. You can try to inform me about how this works (“It will take a few minutes for your changes to be visible…”) but that’s not easy for users to grasp.

3. Aggregate operations will require more coding.

SimpleDB does not support aggregate operations like joins, GROUP BY, SUM/AVERAGE functions, and sorting. You will need to implement these yourself.

Todd Hoff argues that this “suckiness” is a fair tradeoff:

SimpleDB shifts work out of the database and onto programmers which is why the SimpleDB programming model sucks: it requires a lot more programming to do simple things. I’ll argue however that this is the kind of suckiness programmers like. Programmers like problems they can solve with more programming. We don’t even care how twisted and inelegant the code is because we can make it work. And as long as we can make it work we are happy.

I disagree. More boilerplate code distracts you from actually solving real users’ needs. Why reinvent the GROUP BY wheel when MySQL, PostgreSQL and Oracle have already perfected it?

4. Complicated reports, and ad hoc queries, will require a lot more coding.

In my experience, database use falls into three broad patterns: (1) standard queries and updates performed by your application’s users; (2) more complicated reports for users and internal staff; and (3) ad hoc queries for troubleshooting and system monitoring. SimpleDB may be optimized for category 1, but categories 2 and 3 will be much more difficult without SQL.

Complicated reports are probably the best application of the SQL language. Because SQL is a declarative language, it’s incredibly easy to generate aggregate information about your data. In my previous jobs, our reports often required hundreds of lines of SQL to get the right information out of the database. This is a lot of code, but it was required to generate the data for our customers. Without access to SQL, your programmers will need to implement reports through imperative statements, which will exponentially increase the development time.

Ad hoc queries are even worse: they’re usually simpler, but they’re always changing. An RDBMS expert can often write an ad hoc SQL query as fast as the marketing department can explain what they need. Using an imperative programming language to write these queries would destroy your developers’ productivity.

5. Aggregate operations will be much slower if you don’t use an RDBMS.

RDBMSes are highly optimized for performing aggregate operations across huge volumes of data. Fast algorithms like the hash join, merge join, and indexed binary search have been around for 20 years or more. SimpleDB and the Google Datastore return datasets which are more like objects than traditional database rows. It’s unlikely that you’ll be able to process this data with anything other than nested loops, especially if your programmers aren’t database algorithm experts. Nested loop algorithms are considerably slower than the others.

Even if you’re the 31337est database expert and enjoy writing these operations in your business objects, there’s another performance factor to consider. In order for your application server to handle aggregate operations, you will need a copy of all the relevant data on the application server. Rather than downloading a single SUM function result from the database, your application server will need to download all the data required to calculate the sum. This extra data transfer will add considerable latency when you’re dealing with thousands or millions of records.

6. Data import, export, and backup will be slow and difficult.

Oracle, MySQL and other RDBMSes include advanced tools to perform large-scale data import and export operations. These tools have also been refined for 20 years or so, and can process millions of rows per minute. There are no such tools for key-value data stores, because these products are so new.

When you’re processing millions of records, network latency makes a big impact. Most of these services perform a remote procedure call for each record inserted; some even limit you to querying one record per remote call. On the Internet, round-trip latency is usually 20-40ms, which may slow you down to fewer than 2,000 rows per minute. (You can process more quickly via multi-threading, but again, that requires you to write a lot more infrastructure code.)

7. SimpleDB isn’t that fast.

Todd Hoff’s article referenced a SimpleDB performance test which found that 10 record IDs could be retrieved in 141ms from a 1,000-record table; in 266ms from a 100,000-record table; and in 433ms from a 1,000,000-record table.

Compared to relational databases, this is pretty slow.

If you want your web application to be responsive, you need your database queries to operate much faster than this. 20ms responses would be more in line with conventional databases. If you perform 3 SimpleDB queries in series, your web app will take about 1.5 seconds for that operation, and users will notice when the app is that slow. Many web applications actually make dozens of queries per request.

Further, tables with a million records aren’t large enough to need significant scalability. A million-record table is probably small enough to fit entirely in RAM; surely its indexes could fit in RAM. The real test of SimpleDB scalability is its performance on a table with 100 million or 1 billion records.

8. Relational databases are scalable, even with massive data sets.

The world’s largest companies all use giant relational databases, and they’ve been able to make this work. The world’s largest websites use relational databases, and they’ve also been able to scale successfully. Facebook and LiveJournal use MySQL; MySpace uses Microsoft SQL Server; Salesforce.com uses Oracle. When websites like Friendster have scalability issues, it’s not usually because of the RDBMS.

We all expect Oracle to scale if we pay them enough money, but even free databases have made significant advances to prevent the database server from becoming a bottleneck. The first line of defense is caching–eliminating repetitive queries can offload massive amount of processing. Beyond caching, there are free clustering engines which let you balance your database requests around a few servers in a cluster.

Without a complicated clustering setup, your data can usually be partitioned across multiple servers to eliminate the single-server bottleneck. Lest you think I’m ragging on Todd Hoff, he’s written a nice overview of sharding, one way of designing a federated database to get around the bottleneck.

9. Super-scalability is overrated. Slowing the pace of your product development is even worse.

Time-to-market is a critical factor for most software products. If you’re writing internal software for a business, budgetary concerns are equally critical. You can workaround most of the drawbacks I’ve identified above, but it will cost you time and money.

More importantly, all these technical workarounds distract you from addressing the real needs of your customers. If you don’t focus on making something people want, it doesn’t matter how scalable your database is, because you won’t have any customers to fill up the database.

The hype around the new data stores seems to be a case of premature optimization, yet we all know Donald Knuth’s famous quote, “Premature optimization is the root of all evil.” Why not wait and address super-scalability once you’ve created a super product and have generated super cash flow?

10. SimpleDB is useful, but only in certain contexts.

Everyone’s assuming that SimpleDB was designed to be a general-purpose replacement for OLTP database servers. I don’t think it was ever intended for that purpose. SimpleDB’s architecture is similar to Dynamo, Amazon’s internal “highly-available key-value store.” One of its main distinguishing features is the flexible schema: the ability to add custom fields to individual records, and to store multiple values in each field.

If you’re working with “semi-structured” data, then this is actually incredibly useful. For example, it’s an awesome way to persist web application sessions. You can avoid the overhead of marshaling the object-oriented session data into columns and rows, and many of the drawbacks above don’t apply because you don’t generally query sessions like you query more typical relational data.

Amazon SimpleDB, Apache CouchDB, and the Google Datastore API aren’t bad products. But we do them a disservice when we construe them to be replacements for general-purpose databases. Used carefully, they can help your organization. But used indiscriminately, you’ll create a lot more work for your programmers and you’ll make your application perform even worse.

70 responses to “Top 10 Reasons to Avoid the SimpleDB Hype

  1. You nailed it.

  2. Pingback: Shattering news: non-RDBMS not a RDBMS at Pensieri di un lunatico minore

  3. I would suspect that the scalability solutions for e.g. MySQL are for mostly-read scenarios. I don’t think anybody can do linear horizontal scaling of heavy-write scenarios. Except Real Application Clusters.

    That said, Oracle does have an in-memory, key-value pair based system which is highly (5000-node clusters in production) and linearly scalable, can do aggregations over the entire grid, and works on objects.

    Oracle Coherence.

    Costs an arm and a leg, but solves all the issues raised in this article (yes you can do aggregations and SQL-like queries, and they are automatically run “in parallel” across the entire grid).

  4. We heard these similiar things during MySQL vs RDBMS saga, and why soon the sky will fall down on everyone using MySQL instead of a read database. Still MySQL is going stronger than ever. There are a lot of FUD on the net as always. What’s missing is a few real life true story on how someone lost business because they chose to use a non RDBMS solution where RDBMS prefered that they should be doing otherwise.

  5. >This morning I read an article by Todd Hoff which fawned over SimpleDB’s unconventional rules to such an extent that I thought it might be satire.

    You don’t miss much, do you, slick?

  6. So funny:
    “We all expect Oracle to scale if we pay them enough money…”

  7. Excellent points, all 10 of them! Thanks for the write up.

  8. > I’ll argue however that this is the kind of suckiness programmers like. Programmers like problems they can solve with more programming. (By Mr.Todd Hoff)

    I also strongly disagree with this. Yes, programmers do like solving problems but only new problems and challenges not those problems for which the easy solutions already exist.

  9. I thought that the scalability proposed by this model is not in the “number of records in the table” but in the “number of simultaneous reads”. As in having a lot of simultaneous users searching for an item, not as in having a couple of users searching through a lot of items.

    That said, I would’ve just stated one reason not to use this (#10: You almost surely don’t need it).

  10. How can you support shards while not support simpledb?
    Surely you can’t do a group by if you use shards?

  11. Anonymous Coward

    MapReduce and parallel execution solves a fair number of the arguments above.

  12. Using a Real Database (c)(tm) will solve ALL of the above problems.

    The polished turds from Amazon and Google are still turds, though shiny.

  13. Pingback: links for 2008-04-22 « Brent Sordyl’s Blog

  14. I love how you rdbms fanboys are scared out of your pants by the new wave of stuff. I love it.

    Here’s why you /should/ be scared. Google (and to a lesser degree Amazon) /already/ run on this new breed of DB. And their apps /work/. And their apps /scale/, massively. No stupid sharding or rdbms babysitting required.

    Sorry, but if you think traditional rdbms scale without problems, you either have 0 experience with large systems or you are being disengenuous.

  15. ha, and one more thing. Once you shard (as all the big-boy rdbms shops eventually have to do, e.g. youtube), most of your arguments go away, too: you no longer get automatic integrity, consistency, aggregate ops, etc. Once you shard, YOU HAVE TO DO A LOT OF CODING to make up for the INHERENT NON-SCALABILITY OF RDBMs.

    Sorry, you can’t get around it. MUCH BETTER to go in with the assumption that you’ll have these problems (because, hey, you are creating something that’s going to be successful, right?) and plan from day 1 to deal with them.

    That’s one of the beauties of couch/simple/bigtable — you can’t hide behind some empty promise from some big RDBM vendor. You have to face the truth from the start.

    And you know what? The truth isn’t so bad. It’s quite elegant, actually.

  16. You seem to have missed the point of these pieces of software. By structuring your application to use these new datastores, you don’t need to worry about fork-lift upgrades and all the other scaling problems of a traditional RDBMS as you get bigger and take on more load. It scales up just like the rest of you app: by adding more machines.

    Yahoo!, eBay, Facebook, etc scale their RDBMSs by doing the same thing that SimpleDB or BigTable do internally: by sharding the data down to finer and finer levels of keyspace as the number of machines grows. Except, with an RDBMS, this is a manual process. They also use read-only slaves to distribute read load, something also implicit in BigTable/SimpleDB/etc with their use of simple block replication without the concurrent use of erasure coding (e.g. RAID). RAID is also external to the RDBMS, requiring you to manage both disparately.

    Also, Oracle can scale up to 64 nodes at max with a clustered filesystem (this is somewhat old, it might be 128 by now). Google has ~650,000 machines in their clusters. This is 4 orders of magnitude difference. No one has enough money to pay Oracle to scale to this level. Yahoo! and Facebook have gotten MySQL to run on more boxes than this, but not in a cluster, so they (like you and everyone else) are stuck with the manual process of sharding and shard management.

    If you don’t expect to grow, by all means, continue to live in the RDBMS past. However, if you’re app is subjected to possible rapid growth (e.g. a Facebook app, Salesforce app, GAE app, pretty much any Web-facing app) you should definitely be thinking about how to leverage SimpleDB/HBase/Hypertable/CouchDB/etc in your design. I see a lot of posts of this nature lately and they all seem to be coming from the initial shock of what you *can’t* do with these systems. Give them a shot and see what you *CAN* do with some semi-clever design and you might be surprised.

  17. Good entry, Ryan, and quite on mark. I had just come across a quotation of the article you quote in #3, and at the time I thought he was being sarcastic. I’m greatly saddened to think that he was actually being serious.

    I find it most interesting seeing all of the cheerleading for SimpleDB and similes by people who are quite evidently clueless about databases, so they embrace and flaunt their ignorance, using Google and Amazon as a “Big Daddy” of sorts, always ready to reference.

    Guess what, kids – you aren’t Google, Amazon, or Facebook. The chance that your web toy will ever be a fraction as popular as those sites is so vanishingly small that is creating an underflow condition.

    Google has a very specialized database, and their needs are absolutely nothing like almost anyone else. Amazon likewise. Until the day that you build your own specialized database, an RDBMS is often a suitable choice.

    And the scalability ruse….extraordinary. The numbers I’ve seen for these “scalable” database technology are need to be scalable because they’re such incredibly poor performers.

    Alas, everything old is new again. Here we have cheerleaders heralding the arrival of basically exactly what people did before real databases were invented. Hurrah for the past!

  18. “Guess what, kids – you aren’t Google, Amazon, or Facebook. The chance that your web toy will ever be a fraction as popular as those sites is so vanishingly small that is creating an underflow condition.”

    That’s exactly right. If you are creating a toy app, you don’t need to scale. If you are creating a toy app, you should use a toy db, i.e., an rdbms.

    Google is not the only one that needs a scalable db. /ANYONE/ who ever hopes to have > 100,000 users is going to start running into scalability problems and eventually face the reality of the SHARDING NIGHTMARE if they use an RDBMs.

    And if you are making something for < 100,000 users, you really probably ought to just stop now, shouldn’t you?

  19. Okay, it seems to have mangled my last post, so let me format slightly…

    +And if you are making something for < 100,000 users, you really probably ought to just stop now, shouldn’t you?

    Ho ho ho. Awesome stuff.

    Yeah, I guess making systems managing billions in funds just doesn’t cut into realm of the awesome systems that you make.

    You are simply delusional.

    +And if you are making something for 100,000 user sites do you have, jackson? Care to point a couple out?

    Now I presume you must mean 100,000 simultaneous users, because there are quite a few >100K user sites easily running on some shitty RDBMS (e.g. MySQL) on a low-end desktop PC. Slashdot, for instance, which was pretty much a worst case because they were caching nothing, and generating every request live from the database.

    Clearly you have needs far beyond /. in their heyday.

  20. Better still, you have needs beyond Slashdot in their heyday, and an apparently miniscule budget. My dev database server is a 16-core, 6-disk monster, serving up an unbelievable transaction load.

    Not good enough for jackson’s imaginary success story, though.

  21. troll_wrangler

    @jackson – “And if you are making something for < 100,000 users, you really probably ought to just stop now, shouldn’t you?”

    the answer you troll for is “Nope”. in fact, i’d say just the opposite. if you know before you start that your app will need upwards of a hundred thousands users to be useful to its audience, “you really probably [sic] ought to just stop now.”

  22. You definitely have some good points — but you’re also being unfairly harsh and not scoring any higher marks on presenting a balanced viewpoint.

    One obvious one that caught my eye: #7: “SimpleDB isn’t that fast” — Todd specifically pointed out (right in there with the performance numbers he was quoting…) that tools like SimpleDb are NOT fast. That’s not the point; they exist to address scaling issues.

    Some other lines you apparently considered fawning or possibly satiric:

    “If you have a complex OLAP style database SimpleDB is not for you. But, if you have a simple structure, you want ease of use, and you want it to scale without your ever lifting a finger ever again, then SimpleDB makes sense. The cost is everything you currently know about using databases is useless and all the cool things we take for granted that a database does, SimpleDB does not do.”

    That sounds an awful lot like what you’re saying in #10. But you start that off with “Everyone’s assuming that SimpleDB was designed to be a general-purpose replacement for OLTP database servers.”

    Sorry for the rant; I guess I’m just saying you clearly have some useful input to add to the discussion — just leave the straw man nonsense at home, please.

  23. Can someone remind me why application programmers should care so much about the integrity constraint checking afforded by relational databases? There are only a small set of constraints that can be checked without programming. And, from what I’ve seen, the only way to get decent error reporting in my application is to check all of the constraints myself, anyway.

  24. Last time I checked, Amazon *do not* use SimpleDB to power their online store. Rumours of Postgres abound …

    PS – do you know that your comment filter rejects valid email addresses such as root@localhost.localdomain? (at least, that’s where all my cron jobs send it … 🙂

  25. “When websites like Friendster have scalability issues, it’s not usually because of the RDBMS.”

    @toby has it exactly right. RDBMS _can_ scale, but at *significant* costs in both money and developer/sysadmin/DBA time.

  26. Pingback: warpedvisions.org » Blog Archive » SimpleDB, worth the effort?

  27. The hype has been pretty intense and peoples’ perceptions of what SimpleDB et al are useful for have grown pretty inflated. Most applications database component are made up of 3 distinct areas; 1 user related data, 2 data that structures the user experience within the app, 3 rapidly growing and dynamic data. I feel that 1 and 2 are best served by an RDBMS while 3 is a good fit for SimpleDB. Take YouTube. User data and user created scalar data is tiny compared with Video data and its associated data. If a video search is not fully optimised, no-one dies. However, a user does want to be sure that their favourites, channelss, etc are stable.

    Regards

    D

  28. First of all I’d like to note that the below comments are not about SimpleDB but rather to prevent FUD about document-based databases.

    1. Data integrity is not guaranteed.
    This could be the case with SimpleDB, but overall nothing prevents document databases from managing data integrity very well.

    Regarding the constraints, there is nothing that prevents defining validations in a document or its related “meta” document (this is pretty much how StrokeDB works — you can define your validations within meta document and they will let your document stay validated)

    More interesting are the concerns about the conflicts. I’d say that this problem is hardly addressed in a common RDBMS approach. All you usually get is either user’s A or user’s B most recent update — there seems to be no easy way graceful conflict resulution. On the contrary, since document databases approach is rather novel there is certainly enough room to adopt ways to deal with conflicts. For example, with different and configurable algorithms — like merging them slot-by-slot 3-ways, or even some special programmer-defined algorithms. I can hardly imagine how to do this sort of stuff with traditional RDBMS in a relatively easy manner.

    2. Inconsistency will provide a terrible user experience.
    First of all, it should noted that described inconsistencies are also quite possible with distributed RDBMS setups — they too are constrained by a certain lag before the data is going to be propagated through replicas.

    The actual problem is not with lag — it is more about leaving documents in a consistent state.

    This problem could be easily addressed in any kind of database, either relational or document-based.

    3. Aggregate operations will require more coding.
    Again, while this seems to be true for SimpleDB, other document-based databases address this problem pretty well with Views approach (CouchDB, StrokeDB [Views is WIP]) — so you can define any kind of aggregation, even such that are simply not supported by RDBMS.

    More at http://rashkovskii.com/articles/2008/4/26/top-10-reasons-to-avoid-document-databases-fud

  29. Pingback: rascunho » Blog Archive » links for 2008-04-26

  30. You are missing the point completely.

    Databases != RDBMS. RDBMS is but “one” kind of database. Then you have hierarchical, object-based, document-based, etc.

    SimpleDB is but one kind of non-RDBMS database. There are use cases that fit RDBMS, that are use cases that make RDBMS cry. That’s where SimpleDB or other alternatives get into the game.

    Just as simple as that. When all you know is a hammer, all your problems are nails.

    This article is just wanting more traffic by generating FUD to newbies.

  31. I couldn’t agree more with AkitaOnRails, you are comparing totally wrong stuff, this shows in a lot of situations a lack of experience.

  32. The points in this article are true, but not reasonable or relevant, and more importantly highly imbalanced. Hence I am sorry to say -> FUD.

  33. Google scales because it can afford the SKUs for storing their data and hence can throw machines at resolving scaling problems. However, if you need to run Oracle or any other RDBMS, you will have to empty your wallets to scale up. Most online applications do not require the zillion RDBMS features that are not optimized for the characteristics of typical online apps which are more read heavy. I should also point out that I have unfortunately seen people ridiculously normalize their schemas even for read heavy apps when they could have easily spent more on writing multiple times.

  34. Pingback: Англоязычные ссылки с комментариями. Базы данных | Иван Бегтин

  35. Pingback: ivbeg: Англоязычные ссылки с комментариями. Базы данных

  36. Pingback: Will document databases make an impact? at Thinking Outloud

  37. Pingback: Blue Marble Blog » Blog Archive » Sexy Databases - yeah, that’s what I said…

  38. Pingback: Recent Links Tagged With "simpledb" - JabberTags

  39. This article is pretty much on target.. I tried to use SimpleDB as a persistence layer for C# classes… the idea was to use an attribute to store the xml for the class… Couldn’t do it because of the 1k limit per attribute.

    This is actually a big database limitation. My application has a lot of places where people can leave comments.. and 1k is too small. Think about a long email message.. it could easily go over 1k.
    That means you have to split a field into multiple chunks…. 😦

  40. Thanks for the info. May God have mercy on us all.

  41. Carole J Takeshita

    great article!, grats for u site 🙂

  42. Jillian J Monahan

    your blog is really great! 191

  43. Patricia W Olson

    tu blog es excelente! te mando 155 felicitaciones!

  44. > My application has a lot of places where people can leave comments.. and 1k is too small.You didn't spend too much time researching SimpleDB then. You can store pointers to larger data objects stored in S3 if you need more than 1k.

  45. > My application has a lot of places where people can leave comments.. and 1k is too small.You didn't spend too much time researching SimpleDB then. You can store pointers to larger data objects stored in S3 if you need more than 1k.

  46. Pingback: Experimenting with SimpleDB (Flagthis.com) « Scalable web architectures

  47. Pingback: sql server cluster

  48. Pingback: Amazon S3, EC2, SimpleDB - Please enlighten me!

  49. Pingback: User links about "simpledb" on iLinkShare

  50. “You didn't spend too much time researching SimpleDB then” Ummm. yeah lets also add the s3 goodness for something thats easily handled by a rdbms. Do you work for amazon?

  51. Pingback: Websites tagged "simpledb" on Postsaver

  52. Very good stuff.

  53. Amazon agrees, they now offer hosted mysql (one could do it before with your own image) with better support that your hosted version.

  54. 10000000% correct, you are the man, i came to those conclusions the hard way, i wish i could have found this post, all my work went in the dustbin when the website was launched, simpledb was the cause of its failure 😦

  55. Pingback: Top 10 Reasons to Avoid the SimpleDB Hype › ec2base

  56. Pingback: SQL and NoSQL - the rant continues | Prajwal Tuladhar's Blog

  57. I think I can agree with your conclusion. Nosql is not a must. If a certain solution is easier to implement in relational and that satisfies your performance needs, use relational.Problems arise when you're trying to square the circle. When your application needs aggregate functions or joins, nosql is not the way to go. When your app doesn't need a lot of personalization in terms of aggregation (say, statistics based on user preferences) you can just reverse the problem, and generate and update aggregated data as they come in.

  58. Pingback: Linktipps Februar 2010 :: Blackflash

  59. #1. RDBMS provides some rather rudimentary capabilities for enforcing integrity. In any applications, invariants are enforced through the application logic, not the database. There are definitely ways around it, like triggers that can enforce this, nothing that a NOSQL db can't have. CouchDb allows for validation functions that can enforce such integrity. Either way, in most apps invariant enforcement happens on both ends and is superfluous. Invariants should be enforce in a single place and rdbms doesn't have as much power as a Turing complete language to do so.#2. That's completely application dependent. Consistency usually comes with performance tradeoffs. Some apps would much rather accept the write and then reconcile it later (asynchronously), than force the user to wait and block until such a transaction can be completed due to all the lock contentions that can happen in a high traffic environment. Again, it's very application dependent, but what you're describing is called “Eventual Consistency”, you can read more here http://queue.acm.org/detail.cfm?id=1466448. RDBMS themselves are not necessarily the bottlenecks, rather it's ACID transactions which are. Every high load/concurrency app eventually trades transactional semantics for performance. It's up to your application to figure out how to reconcile temporal inconsistencies if at all. If you still think it's that important, just ask eBay. They don't use transactions and utilize the eventually consistent model? What does that mean, it means that there might be a very minuscule chance of some inconsistency and they might have 1 out of 100K spooked customers, but who cares, they've just achieved a 99.9999% customer satisfaction, as opposed to many unsatisfied customers if their system is somehow unavailable or slow.#3. Probably, but the same is true when you shard a sql database. Also, through map reduce operations, aggregates are actually more natural and some key/value DBs now provide such operations. The great thing, map/reduce is completely abstracted from distributed database semantics, so you can easily distributed this operation over numerous remote nodes.#4. Not sure what you mean. Maybe they'll require non-SQL coding, but not necessarily “more” coding. And coding might actually be shorter as in some languages you utilize a Turing complete language to apply predicates vs. the limitations of SQL and it's underlying relational model. Again, it's true that RDBMS are more suited for reporting at this time, mostly because they've accumulates lots of experience over the years. On the other hand, if reporting is not a huge part of your system and you want it in a relation database, just replicate. Data warehouses do this all the time to optimize/denormalize data for reporting, as reporting on highly normalized data is also very inefficient. #5. I think that's a side-effect of the network. Yes, it's true that RDBMS algorithms might have been more optimized over the last 20 years, but that doesn't stop from NOSQL DBs from getting faster. Basically, there is no theoretical reason for a “local” NOSQL query to be any slower than RDBMS. It's all in implementation details of that db, storage structure, search algorithms, etc… Because SimpleDb is distributed, the side effect of aggregate functions is “more latency”. The same side effect would be true in a sharded RDBMS.#6. RDBMS might have more tools now, but again, that's not a reason to necessarily disqualify the benefits of a different storage model. Also, have you looked at CouchDBs replication? Talking about fast and easy.#7. See #5 response. Also, SimpleDb wasn't designed to be fast per say, although that wouldn't be a bad feature. It was designed to be linearly scalable, so you might incur some latency, but that latency should be constant with increasing load.#8. Nonsense. There is a sweet spot for RDBMS systems and I use them in any applications which has a requirement for such and/or where the data fits into the relational model. I hate having to square pegs into round wholes. Like storing highly dynamic and hierarchical data in a relational database, either using a convoluted normalized model or using skinny tables, which defeat the purpose of the relational model completely. Also, you mention Facebook, MySpace, etc… Yes, they use a RDBMS engine behind the scenes, but if you look at their storage model, they are utilizing it just like a key/value store. Basically, they're not benefiting from any features that RDBMS systems are good at.#9. Agree, you don't want to make that the top priority unless you have to. But I think many people read “don't optimize prematurely” as a ticket to forgo such activity. That's the worst thing you can do. I faced that personally, when you only worry about features and not the long term requirement changes/scalability and then your system fails in production due to an unexpected load which you never accounted for and/or thought a your architecture can handle. What then? Well, besides making up excuses and trying to savior any of the relationships that still exists with users, you're up for 2 weeks straight not sleeping doing what you should have done upfront. That's like building bridges and only being able to handle 5 cars on a bridge at a time, because in this rural community we'll never have traffic. Disastrous results await.#10. No shit, everything is useful only in certain contexts.

  60. Valid points you make. These new data stores all remind me of Tandem's Enscribe filesystem from the 1980's. This system is still in use for very limited uses that need low latency (i.e., trading systems) but Nonstop SQL took over for more complicated uses. Keeping the DB functionality on the server reduces code complexity and duplication of code.Your most valid point though is that fewer bytes of data must be transferred if you push the functionality down to the server. This saves enormous time and is a basic functionality of a true Database server and not just a datastore.

  61. Interestingly, the most transaction intensive websites in the world try to avoid consistency checks in their RDBMS database to achieve scalability and performance. Check out EBay architecture documents for reference. Although RDBMS provides a good way to structure data, they certainly seems to fall short in terms of reaching ~billion+ transactions/day. A better approach to programming is to design for fault tolerance, asynchronsity and redundancy and SimpleDB fits well in that thought-process. Another plus for Todd Hoff’s argument is that, this way developers are not tied to the algorithms a particular database and are free to implement ones that are more appropriate for their individual needs.

  62. When i was an engineer at yahoo, i lucked out and sat in the same cube block of all the senior engineers. all of them were from database companies – Oracle, Informix, Sybase… one day i had to ask how to get a production mysql server provisioned for a project i was working on, and I was immediately denied.

    These guys knew so much about relational databases, that they had an instant awareness of when they *shouldn’t* be used.

    At yahoo we used a home brew database that sounds something very much like simpledb – completely unstructured and unenforced – and taught me everything i needed to know about choosing the right db tool for the job.

    But we also had some massive oracle systems too, for stuff like data mining and the directory.

    But when it came to serving users pref data t page time, hitting an sql query was absolutely verboten.

    This is why you see companies such as dig run into 5-year scaling nightmares with mysql – all you need to know about mysql is that requirements are unique and mysql can scale some stuff ok, some stuff not at all, and some stuff you can scale but living with the implementation will make you cry like a little girl.

    My point is none of these databases are ‘sucky’, but instead fulfill different sets or requirements.

    Much more often then someone choosing one of these for simple stuff, you will see developers just try to cram every data problem into mysql.

  63. Stuff like Appengine and Bigtable offer massive leverage for someone like me.

    Im prepared to jump through certain hoops in using bigtable (mostly the hoops associated with not being able to use joins) because, working on my own pet project as the only edveloper with a miniscule budget, a bit of design/development tomfoolery is something I CAN accomplish.
    Purchasing and learning how to design and maintain a RDBMS cluster is not something I can accomplish.
    Its a simple equation for anyone in a similar situation.

  64. Good discussion. In general I think it’s true that people get excited by how cool this stuff is, but forget they’re not Amazon or Google. Keep in mind that Google and Amazon didn’t launch on this technology either, they evolved it to deal with their massive scaling problems. If I was in any danger of getting those kind of scaling problems I would hire the world’s brightest rocket scientists to solve the problem for me, and send the progress reports to me at my swimming pool on board my luxury yacht. Back in the real world, time to market is key, so I’d use an RDBMS for most jobs, a free one if I had no money, or a cheap one (SQL Server) if I didn’t have much. At the point where throwing money at Oracle came into view, I would consider one of these NoSQL options, if appropriate. As noted, I might well solve some of my problems with NoSQL, other problems with RDBMS, pay Oracle to solve other problems. No reason to enforce a 1 size fits all policy, because it doesn’t.

  65. You guys all rock!
    I love all the information in the comments!

    My personal conclusion is we are building businesses here. Constraints on funds creates more constraining deadlines and the DB that is used should reflect that constraint.

    Ask your self is this a “toy app” (marketing widget, etc) or a potential next 100K requests per second or more site.

    Ask your self, if you use simpleDB will dev be so heavy you will never get of the ground.

    What are your options? Is it one or the other? Can you devise a plan that uses a central super lean RDBMS and then leverages simpleDB for the parts of the data that does not require the advantages of RDBMS?

    Its not black and white. It depends on your situation and this article with its comments has been great for assessing my own needs and limitations.

    thanks guys

  66. Pingback: NoSQL Daily – Fri Sep 24 › PHP App Engine

  67. Ok, after facing multiple problems with SimpleDB for over 3 years, we are migrating back to MYSQL. here is the list of some of the problems we faced:

    1. No backup and restore for SimpleDB – we wrote our own tools – they are very slow and not scalable.
    2. SimpleDB – there is no easy way to do SQL query even for a simple COUNT operation. We wrote code to do this and ended up with 4500 CPU hours to run a simple count query for just 500000 records!!
    3. 10GB limit – always bothering us
    4. Very frequent errors accessing the records – not acceptable although it recovers immediately

  68. Pingback: Rare Mile Thinking » Blog Archive » Simple DB: Restrictions and Drawbacks

  69. qaydevfp zlgzluhfrw dpqyzudmzs

  70. Pingback: Amazon SimpleDB | rabinovanessa

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s