The DynamoDB Book: Begin Book Club
by Simon MacDonald
@macdonst
on
We had a great time at our first Begin Book Club meet-up with special guest star Alex DeBrie the author of our first book, The DynamoDB Book. Sadly, our recording had serious audio problems, and our backup recording had no audio. Besides learning a lot more about DynamoDB, we also realized we need a backup to our backup recording.
The transcript of the hour-long book club follows. Also, there are instructions on how to join the Book Club at the bottom of the post.
Enjoy!
Transcript
[00:04:20.13] SIMON: Why don’t we get started. Thank you everyone for joining what is the first book club that we’re hosting here at Begin, and we have to thank Alex DeBrie, the author of the DynamoDB book for joining us. I’ll do the housekeeping before we get started. We, this is the architect community so anything that is happening here is also under the architect’s community code of conduct, so we will be enforcing that. I don’t expect there to be any problems, but just thought I’d say it just in case. As well, if you really like this, let us know on Twitter or reach out to us on Discord. We plan on doing another one of these in a couple of months, so we’re also looking for suggestions for the next one as well. And without further ado, I’ll turn it over to Alex who is, I believe, giving us a dramatic reading of a passage from his book.
[00:05:13.14] ALEX: It’s true, I will, on request, give a dramatic reading. I have a rare first edition of the DynamoDB book here that I picked up this morning. This is the only one in existence, unfortunately.
[00:05:27.05] BRIAN: It’s so huge.
[00:05:28.20] ALEX: It is huge, it’s a Robert Caro, The Years of Lyndon Johnson biography. All right, here we go. Dramatic reading. Ahem.
"When modeling for a relational database, there is basically one correct way to do something. Every problem has a straightforward answer. Data modeling with an RDBMS is like a science—if two people are modeling the same application, they should come up with very similar patterns. With DynamoDB, on the other hand, there are multiple ways to approach the problem, and you need to use judgment as to which approach works best for your situation. DynamoDB modeling is more art than science—two people modeling the same application can have vastly different table designs.”
[00:06:12.09] SIMON: Wow.
[00:06:12.23] ALEX: That’s in Chapter 10.
[00:06:14.02] SIMON: That some really powerful stuff. Coming straight from the spiritus mundi?
[00:06:18.25] ALEX: It is.
[00:06:20.13] RYAN BLOCK: Do we have some canned applause that we can play now?
[00:06:22.09] SIMON: We’ll fix that in post, Brian.
[00:06:25.14] ALEX: Okay, perfect. Thank you for indulging that.
[00:06:30.29] SIMON: That’s awesome. We’re happy to have you do that. Yeah, so I mean, there’s a lot of people here from Begin, there’s a lot of Architect community as well, and I know that the folks from Begin are all, just huge fans of the book. It was very helpful writing the @begin/data package. Brian, I was wondering if maybe you could kick things off, get things going, ask Alex a question, or mention comments on the book?
[00:06:59.27] BRIAN: Yeah sure. I can’t gush enough about this book. It was definitely required reading as far as I’m concerned here in the AWS community at some point in the past for DynamoDB and probably have a high degree of what the fucks per minute and fortunately, Amazon docs don’t do a whole lot to fix that, so I’m glad you did, cause in my eyes, I have a bunch of different kinds of questions I wanna dig in with you. I had the Moderna booster yesterday, so I’m not 100%. But one thing that keeps coming up in my Twitter feed these days is Rick Houlihan now works for Mongo, so should we all be using that, which is, obviously I’m being funny. I love Rick Houlihan and think you two have had some of the best re:Invent sessions I’ve actually watched, so those are, playing along at home or listening to this in your future internet, definitely check out the re:Invents with Rick Houlihan and Alex cause it’s a master class from data modeling for wide-column stores and there’s just so much. Rick just recently jumped ship and went back to his home planet of MongoDB and the key-value stores, Mongos really popular and stuff going for it is, love to get your impressions on that.
[00:08:55.10] ALEX: Yeah, sure thing. First of all, thanks for the kind words, on both the book and the re:Invent sessions, appreciate that. And yeah, I love Rick Houlihan and I owe it to him in just in terms of learning from him, you know, project session before messing with Dynamo, but also just like being able to help out with things, and obviously, he knows a ton, he, he basically helped migrate Amazon. com in the US, all these things to Dynamo, internally off Oracle or MySQL or whatever onto DynamoDB just know that a lot of that stuff has a really nice scale. It’s interesting the switch for it, I was definitely bummed to see him go. I’d be curious to see how it plays out, you know, I think, I always tell people that Dynamo and Mongo are two very similar underlying technology-wise, but that they just have completely different philosophies which makes people feel very different. So I would say Mongo is Libertarian and Dynamo is Authoritarian, so Mongo will let you do whatever you want, right, it’ll, it’ll let you sort of use joins, it’ll let you feature aggregation queries, it’ll let you do spatial indexing, all sorts of things. And if you use it correctly, or if you’re on a small scale, that’s not going to matter. But I think where people can get in trouble is like if they start off early, it feels really fast, and they are doing things easily and then all of a sudden you’re starting out and you’re not doing the same sort of large scalability, I think you get out of Dynamo because they’re using things that don’t, that don’t match with the underlying technical model of how models work. So where Dynamo is just like not gonna let you do that, you have to do it this way, and you have to match our model exactly, and you get some really nice benefits out of it, you can get predictability. Marc Brooke from AWS wrote a nice post on Dynamo today on how he loves its predictability and that’s like, it’s so, I think it’s underrated when you’re first getting started with Dynamo, and then it just becomes something that, it really makes me nervous to go back to other databases now because I can’t predict how they’re going to work when, you know, this goes to 100 times scale that we have initially whatever. So just, you know, if you want that predictability, know that’s a big thing, and you can’t quite get that with Mongo. I’d say the big downside of Dynamo’s approach right is it’s harder on Day 1 to pick that up, or it’s harder if you’re if your access patterns are changing and you need a little bit more flexibility, Dynamo’s going to feel more restrictive that way, but it’s gonna help you in the long run I think.
[00:11:26.25] BRIAN: Yeah. The familiarity things comes up so often, and it really is a bit of a bummer, because at certain points I feel like in your career you stop caring about what’s familiar and start caring about the semantics. Familiar is a nice way of saying user aesthetics I’m used to, just subjectively something I’ve done in the past or done before. And relational comes up all the time with as flexible, which always baffles me because you need to know the schema there too, it’s the query patterns that are flexible but I want to dig into that a little bit and talk to the weird familiar things and like, how did you grow up getting over it, and like I know when I first saw single table design that was tough, uh oh and eventually well actually I want to use it and suffer the consequences of it.
[00:12:19.12] ALEX: Yeah, yeah, yeah. I think the nice thing about relational, right, is you sort of learn how to improvise your data model, and you can do that and as you have your queries, it’s just a matter of writing the new query with the different joins maybe add some indexes and that stuff, and it feels very flexible for you to do that down the road and make changes to that sort of thing, you know, maybe it’s going to cost you more in certain ways, or certain hidden ways that often don’t come up till later but it feels very easy if I have this index and it’s going to work super nicely, that’s nice. Whereas Dynamo, you know, it’s very much, unfortunately, you use your primary key and that way, and you have a new access pattern where that primary key is not gonna work, or the secondary index is not going to work, it) makes it more complicated to do that sort of thing. Or even doing like simple aggregations or different things like that can be tough with Dynamo, so I think that’s the hard part and the lack of flexibility part. And also just like really having to focus on some of these new patterns. I think, I think relational is more complicated than everyone’s, most people have learned it already so they’ve had that initial learning curve they had. The nice thing about Dynamo is like 5 concepts you need to learn in getting like direct access to the data structure themselves, and if you just learn like those 5 calls you can do whatever, you have to learn that and sort of restructure your data from this relational model. You know, I think that the Day 1, 1-2-3 experience with Dynamo is worse cause you’ve got to learn a bunch of new stuff to work through. They day 4 to whatever experience is a lot better because you just get that predictability, maybe less operations. Like sometimes I talk to teams hey what operational signals should I be watching to make sure I know when an issue’s coming, and it’s like, honestly like, you know, not much. Are you close to your throughput in which case, if you just increase your throughput, good, you know, maybe if you have a lot of transaction conflicts which happen like that but you’re just not gonna have the same sort of leading indicators of problems like you would with PostgreSQL or MySQL.
[00:14:35.03] BRIAN: Yeah, definitely scale’s really nice, once you get those patterns baked. Sometimes single table design makes sense and sometimes it doesn’t. Dynamo, doesn’t even matter, put anything you want fast. But sessions don’t make sense for single table because–
[00:15:22.22] ALEX: Yeah.
[00:15:23.11] BRIAN: I want sessions to expire and go away. Not be backed up.
[00:15:29.29] ALEX: I totally agree, and I’m not quite as much as like a single table purist as Rick was, I would say. I think the big thing is really thinking, I took like access patterns first design where you’re thinking about access patterns, and you’re also using sort of the generic primary keys and secondary indexes, things like that. And then once you do that, design that, you know, you can decide to put it all on one table, you can decide to put in different tables, it’s more a decision of like, A, do I want to be retain capacity for a couple different tables, or do I have a set of data like sessions, totally separate and don’t need to be in the same table. Do I have the streaming requirements for different sets of data you can think about things like that, rather than the data itself. But like, you know, it’s all going to work in the same table that your using generic primary keys and patterns.
[00:16:20.05] SIMON: And as somebody who spent most of their career trying to avoid relational databases, it helped that I didn’t have to unlearn quite so much stuff to get into DynamoDB, so I appreciated that. There’s Fil Maj also experiencing a snow storm in Toronto, Fil. Do you have any questions you want to ask Alex?
[00:16:41.09] FIL: Hey, so far so good, not too bad. We’re staying indoors though, so we haven’t dealt with all that snow. Alex, thanks for joining, really appreciate this. Just want to mirror what everyone else has said, like really incredible book, helped me a ton. I recommend it every chance I get to anyone who’s curious about Dynamo so awesome job on that book.
[00:17:03.29] ALEX: Thanks, I appreciate that.
[00:17:04.29] FIL: Questions, yeah, for sure. Questions, so I got two kind of going around in my brain. It’s been a while since I’ve reviewed the book so maybe you mentioned it, but I think one thing in my discussions, trying to get people to, you know, embrace learning Dynamo and trying it out and playing with it, seeing how it changes the experience of building all different kinds of apps is I find it helpful to talk about what it’s not good for, or the use cases that are difficult in tackling with it, so I don’t know if there are any off the cuff in your head like, you know what, maybe it’s better to go for this other kind database technology other than Dynamo than these particulars. Do you have any examples of this off the top of your mind, or do you use Dynamo for absolutely everything no matter what?
[00:18:08.07] ALEX: No, I would say Dynamo’s good for like 80% of, probably even more, of major production use cases, in this case OLTP stuff particularly. And then there’s like different levels of things it’s not great at. So like two I say it’s really not great at is gonna be full text search, which is almost always going to be a Elasticsearch or maybe Algolia now. And then the second would be like aggregation it’s not going to be good at analytical high queries in general, if they’re big sort of, big and flexible analytics queries those are going to be data warehouse type things or OS type systems but usually slower type queries anyway. So those would be the two big ones. The next one I would say that, it’s not completely out of the realm, but it can be difficult would be complex filtering. So if you have a batch of data that you want to be able to filter about like you know, or to 10 different fields, all which can be optional, right, like if someone goes to a CMS and they have 200,000 people they want to filter by, this thing, this thing, this thing. It can be sort of whatever they want. That can be tricky to do with Dynamo. It wants more predictable access patterns and predictability means that you giving me these three fields and I’m giving you the right thing back, right? Anyway filtering two of the 10 fields or 8 of the 10 fields, it’s trickier. So that’s a tough one. Those are the big ones I say. Other than that, you know, most things can work pretty well. The only thing I would say is like, hey, if you have a small amount of data where like you, SQL, like, you know, relational databases are pretty good at performance now, unless you’re really, show like hey, if you wanna do relational and you have a small amount of data where it’s not going to matter, you’re just wanting to pay for a big enough box, that’s fine with me, I would say like Dynamo can do almost all things that a relational database can do.
[00:20:14.11] BRIAN: How do you paper over this stuff? Do you just like find them, stream the other data sources or?
[00:20:24.09] ALEX: Yeah. I would say for a search particularly, like, I mean honestly for all three of those, Elastic is often what people would pair it with, you can really use all three of those. I’d say like Elastic is at the completely other end of the spectrum in terms of predictability in my circumstances, it’s just like, it’s just a complete black box, I have no idea on the factors that affect, indexing speed or query speed or anything that will make it just blow up and bomb, it feels like it just happens. So, the last thing makes me nervous. But I think, you know, whether you want to use it for internal analytics or whether you want to use for full text search or complex filtering, I think it will do all of those. The biggest thing I recommend to people is really narrow what Elastic is doing, I think a lot of times people stream it from Dynamo into Elastic, put all their data in there, and Elastic starts like chewing up more and more queries cause it feels easier to people, they’re like, “Hey, we have this new active pattern,” rather than re-index in Dynamo, let’s just use Elastic. And now you have Elastic as like a hot path for a lot of different things, your cluster’s getting huge. Whereas like hey, if you just screen like small bits of records to allow you to do the complex filtering you want and then go back to Dynamo to get the canonical record or just to do a full text search on these small fields, like Elastic’s going to be easier to scale that way, rather than if you have a Terabyte of data and you’re trying to do all sorts of things, plus your internal VI Team is meant to run analytics like that, then you’re just going to have a bad time. So I’d say Elastic is a common one. Another one that’s like pretty interesting, I think, is Rockset, which is like a SaaS, they hook into all sorts of databases change data capture so hook up your DynamoDB table, they’ll do an initial export and then hook into your streams, and basically just like re-index your data in a couple different ways, like one column, one search, one sort of like complex filtering type things, as you do aggregations and analytics, you do some search, it’s pretty interesting how it complements Dynamo in that way right, it’s not for your transactional purposes, cause it’s always going to be lagging behind and things like that, and you can really do, you can’t do writes against it and stream right after, it’s what they do. But it’s pretty interesting for those secondary indexes type queries where people reach for Elastic, I think that’s a pretty interesting one as well. I had one client that they had like a 6 or 7 Terabyte table, they were having trouble doing it in Elasticsearch. They started using Rockset, and it saved them a ton of time so it’s interesting stuff.
[00:23:10.03] BRIAN: Yeah, I’ve seen this one before. I have a poor man’s solution: stream into S3 and use Athena
[00:23:16.15] ALEX: Yeah, and that’s another thing, yeah. The tricky part of that is if you have like a lot of these updates on your data, then it’s like how do you sort of, you know, there’s like a mismatch between your OLTP fast moving stuff and your OLAP whether it’s S3 Athena, where like how do you handle little updates and stuff, you know, OLAP in general, not very good at updating small records. And so if you have data that’s not changing a lot, and that pattern works really well, if you do have some updates then you’re gonna have some way to merge it and things like that.
[00:23:53.10] BRIAN: Yeah, it’s really only good for a store of data, I guess.
[00:23:57.08] ALEX: Yeah, yeah, yeah.
[00:24:00.09] SIMON: Fil, want to ask a follow-up question there?
[00:24:04.13] FIL: Um. . . different topic. I did have another question. And this, the topic is data migrations. So I definitely, reading in the book, took apart your suggestion, right at the start of the book is “sit down and map out the data access patterns that you expect in the application.” So, that definitely helps a lot to get an initial idea of modeling data, what kind of queries we’ll end up using, the performance characteristics and all that good stuff. But invariably at some point requirements will change, and sometimes the schema, or like the data access patterns that you came up with lead to a schema that kind of backs you into a corner. And so, me personally, I mean I’ve just dealt with it basically by maybe streaming records to a Lambda, transforming them and then dumping them into a different cable, and then once that is done, maybe then I’ll define some new indices or what not to do a migration. I guess my question, from a very selfish perspective, is, I mean, that approach works, is there any, any other way of doing that, that maybe you have tried or really anyone else in the group here?
[00:25:32.05] SIMON: Yeah, and it’s okay to answer that question with your consulting rate, Alex.
[00:25:35.29] ALEX: No, no, this is great stuff. Like, no, this is great. So, it’s a good question and probably one of the more common ones I get, like how do I do migrations, cause I’m telling you to define your access patterns upfront, but stuff changes for sure. First thing I always say is like figure out what type of migration you were doing, right. Like if you’re just, if you’re adding a new field to an existing entity, but it’s not something you’re accessing by, like you could just do that in code, you don’t have to go update all your existing items, your code can just say, hey, if you get an item that doesn’t have this, like what’s my default here, and put that in, that’s easy. If you’re adding a new type of item into your table that you didn’t have before, what you do is you go through the same steps as you do, like how am I going to access this item, start writing it in, you don’t have to do anything. But the hard ones are the ones you’re mentioning, right, I already have some items in there, and now I need to access them in a different way, how do I do that? I always say tell like people like, hey, if that attribute is already on there and you just need to sort of like do an in place thing, you can add a new global secondary index. Usually what you’ll need to do is do sort of like what you’re saying, like an in place ETL operation where you scan the whole table, how you identify the items that you need to change, so imagine if you have users and organizations in your table and you only need to update users, you know, you scan through your table, you find a user, and then you automatically just do an in place update on that, add GSI 1PK, GSI 2PK, whatever you need to get that ready, and then keep going. Once you’ve scanned through your whole table, added those things, you add your secondary index and you’re good to go there, and do that. You know, sometimes you do have situations where you need to like completely flip your table or redesign it in a lot of ways, and that would be more like what the one you said where, hey, we’re gonna scan our table instead of doing an in place update we’re gonna write it to another table, get it ready in that way and then flip over at that point. So, that’s definitely doable, you know, that, yeah, it sort of depends on your needs, but yeah, make sure you’re clear on what sort of migration you’re doing and then handle it that way. One thing I wish that Dynamo would do is, you know, I was saying that 3 step process of you scan your table, you identify your item, you make the update on your item if you want to do that. I wish they had a way that you could just do that on a table. Maybe you wrote like a filter expression to say hey, this is, if I’m doing a scan, here are the items I want to find, for the ones that match, here’s my function I want to run on it, like an update or whatever, and then they just manage paralyzing that scan for you and doing it, call it like a table function or whatever, but, you know, you could build that like a step function that people can use and easily fire this up, it seems like Amazon should be able to do that fairly easily, and then just charge you for the compute there.
[00:28:17.13] BRIAN:1 on 2 which makes it a bit annoying to write–
[00:28:21.29] ALEX: Yeah!
[00:28:22.07] BRIAN: Do this thing, now you got this code. . .
[00:28:26.06] RYAN BLOCK: Yeah, I have done this at Begin a number of times across a variety of tables where we needed a GSI or something like that, and then done exactly what you just said. Scan through the database, moved through, and I actually wound up. . . but I have a question here. Because of the sheer terror of doing that, I would always collect all of the objects and then write them to JSON to disks so I have a backup just in case I screw it up. And so the question is, and so I followed that pattern, I believe that that is like the right way to do it , I also find it terrifying, for the reasons you’ve already kind of (inaudible) can solve them. What drugs should I take to manage that terror?
[00:29:09.21] ALEX: (laughs) I don’t have any good recommendations to do about that. But you should definitely be doing something for sure but yeah, that is, I totally agree, like there is some terror, and then as Brian was saying, the infrequency of it where, I mean it feels like some person in AWS should solve this with error handling with all this sort of stuff, have a debug mode, I want to run it in debug mode and see what my stuff is, and make that easier to do. in terms of doing backups, you know, you can, you can trigger a backup right before you start so at least there’s a backup, and you can backup, you can restore from that backup if you wanted to, Brian, you can also do an export to S3, which is gonna dump your entire table to S3 and give you a dump that way if you want to. So just a couple things to help you, though I don’t have any. . . I have nothing to take the edge off of actually running the operation, which is a little nerve-wracking.
[00:30:02.26] RYAN BLOCK: Yeah, it’s true. I mean, you know, the Dynamo backup solution is not amazing, right, because when you restore, putting restore backups into a different cable, if we’re talking about production data, there’s guaranteed downtime and then you have to sync it in between the tables, we lost track of (inaudible) table. So in my experience, it’s been like, you know, just (inaudible) experience (inaudible) but anyway. I appreciate that feedback.
[00:30:28.18] BRIAN: I know that you’ve done this, Ryan, cause this is what I do too. I often will also script this out as a series of unit tests and then play it against DynamoDB local, just to be like “sure” sure that this is gonna happen the way that I think it’s gonna happen. So you’re still like putting in all these like, for every possible operation, there’s a possible throttle. That throttle happens midway through, you’re fucked, so you have to make sure, your three tries, you’re testing everything, I do it in regular transactions, and they’re in a timely spiky little. . . There’s just so much to worry about there for everyone, so could, they could help with the anxiety of the task.
[00:31:07.21] ALEX: Yeah, yeah.
[00:31:12.20] SIMON: Well, Ryan, do you have anything else that you want to ask?
[00:31:15.25] RYAN BLOCK: Yeah, I had another question. So, you know, obviously, maybe not obviously but, you know, sort of pretty big DynamoDB users and fans for many reasons that it is an amazing tool, guarantees that it gives the. . . you touched on all the things that you just don’t have to think about. And certainly, you know, an aspect of that so you get over that initial bump, which is obviously really useful for, you’re just in this amazing promised land. One of my things I kind of pogo on is that DynamoDB this week. Why has no one cloned, sorry, I mean people have cloned other very (inaudible) tools and AIs like out there too, why is, you know, I mean when you think about SQL, you know, SQL is like a generic thing, right, you learn SQL in theory and then you can apply that to a variety of systems. Why do I feel like Amazon has not seen Dynamo semantics, or guarantee has been cloned?
[00:32:32.17] ALEX: Yeah. That is pretty interesting. I would say a couple things, like, well I mean first of all, there is a little bit, so ScyllaDB, which was like, like a C++ version of Cassandra basically, they did implement a DynamoDB API on top of that, so if you wanna use ScyllaDB, you can use the DynamoDB API at least. But you’re right, we haven’t seen a lot of like true copiers to that. A couple reasons like off the top of my head, like DynamoDB is still not as popular as I think it should be, right, like if you look at DB engines it’s 15 per show, you know, Mongo’s going to be 6th or 7th in the ranking, quite a bit more popular, same with Elastic, things like that. And then obviously dominated by relational database as well. So it might not be quite popular enough to like really get some of that, whereas like, you know, Mongo’s got a couple of copycats, whether it’s DocumentDB from Amazon. Also Cosmos on Azure has a Mongo API layer that you can do on top of that. So just a little more popular in there and some of that. The other part is like there’s a lot of technical stuff happening under the hood, like Dynamo is part API innovation and things like that, but also the techno, technical guarantees they can give you, right, and part of that is like basically guaranteed scaling on that stuff, also the fact that they have like allowed these small partitions, right. So in DynamoDB partitions between max 10 gigs behind the scenes, whereas like, I dunno, I’ve worked with Mongo deployments in like, we had Mongo shards around 200 gigs, it’s just hard to manage all those, you see people have much bigger shards and things like that. So, you know, if it was open source or something like that, now you need people to know how to manage all these different partitions shifting around and. . . and then Dynamo’s also got like these hard, strict limits around partition through those limits, right, or just like the guarantees they can give you around how much provision throughput that you have or how much your partition can handle, things like that, where it’s sort of like if you don’t have the whole package, then it might not be, it’s not as good of a benefit, right. You had just the DynamoDB API but without like the hidden guarantees around performance and things like that or around how much capacity and predictability you have, capacity and the performance, things like that, and it wouldn’t be as compelling to package, in order for teams to have, to field that technical expertise to really manage that. So maybe that’s part of it. I dunno. That’d be sort of–
[00:35:14.09] BRIAN: Someone did, they did get a clone. They got a clone so there’s a weird sequence of events here. First, there was a white paper called the Dynamo Paper. When that came out, a whole bunch of things happened right after, Cassandra and CloudDB were definitely inspired by that paper. Amazon then released DynamoDB implementation later. The new implementation is not quite the same as the paper, and I don’t know if people know that, that may be part of it too. And the data modeling patterns are portable as a result. Query libraries are not that portable as semantics are wildly different across implementations. The whole benefit of Dynamo is hosting. It’s more because Ellison has done with Oracle than it is because Amazon’s gonna be, maybe they will in the future, they kind of have a good track record of keeping the prices low so far. So, I dunno, I think that’s a big one, that’s tripped me up though. People want it to be open, and it’s not. And full stop it’s just not. And so that’s a big stumbling block.
[00:36:41.19] ALEX: Yeah, yeah.
[00:36:42.07] RYAN BLOCK: I think my question may have also played into concepts, which is like the guarantees around Dynamo and the semantics of Dynamo. And I think the semantics are, in some ways, more important than the guarantees. So I’m going to go back to what Brian said, you know, about like the technical familiarity. So to learn the semantics is challenging, and then to implement them is to re-implement that, what’s implemented. So, you know, I think maybe folks might be more amenable to adopting Dynamo if they felt like, well, okay, we can swap Dynamo afterwards, or some other solution, if it was the same (inaudible) just a matter of swapping (inaudible) or (inaudible). So I guess my follow-up question is do you feel like there’s a mixture of (inaudible) Dynamo semantics (inaudible) such that like you could clone the semantics without cloning the guarantees that it makes as well?
[00:37:46.03] ALEX: Yeah, in some sense. I think it would be, the guarantees is interesting around Dynamo because I think it, it’s the only database I know that really is as strict about sort of the guarantees about what you can and cannot do, right, and how it’s going to act on more data or less data, just very clear about that and understandable. And a lot of that is based on, you know, how. . . how the primary key is sort of implemented, which is similar to how the primary key is implemented in Mongo or Cassandra and the other one that Brian’s mentioning. But then the API restrictiveness, the combination of that technical underpinning, but then also the API restrictiveness and how it sort of forces you into that, I think, is related to that predictability and I don’t know anything else and can really get towards that. So, yeah, that’s what I would say–
[00:38:42.22] BRIAN: It is a small API.
[00:38:45.15] ALEX: It’s very small. Like yeah, right?
[00:38:48.04] BRIAN: That’s the best part!
[00:38:57.25] ALEX: Yeah, yeah. One thing that was interesting, cause this is kind of a related story, but I was talking with someone at re:Invent who works with. . . it’s more Cassandra and Amazon Keyspaces actually. He was talking about how some of the like open source Big Data technologies can have a little bit of trouble with Keyspaces, because Keyspaces, it’s, if you try and overload it, especially like a Big Data tool, it’s like reading a time out or writing a bunch back into Cassandra just to like bulk-load from a Big ML job. The key spaces will, like if it’s over throttled it’ll just like reject it immediately back at you, whereas like open source Cassandra, it would basically just like kind of hang until (inaudible) it’s like all the Big Data tools are just like, hey, throw all the data at it, it’s not gonna reject immediately, it might just go slow but who cares, it’s a batch process, it’s happening in the background, nothing’s gonna matter here. And now they’re getting all these errors happening from their job is they’re using key spaces, key spaces actually just like hey, if we can’t satisfy it, we’re gonna throw it back at you and you don’t like have the retry mechanism just because it’s not how Cassandra sort of works. So it is kind of interesting to hear, even a different approach between Cassandra which is so similar to Dynamo but also very different in a few key ways.
[00:40:14.27] BRIAN: That’s funny, 'cause that’s like a direct hangover from the leftover provision capacity and then we don’t have to worry about, Alex–
[00:40:22.19] ALEX: Yeah.
[00:40:23.20] BRIAN: It’s a fallacy, people think, I have four servers, I’m good. And that’s bullshit. And you also don’t know where you’re good, so like you just don’t know what they are. Quotas and limits are published, you know, with the SLAs where you have design flow, which will happen. It’s a thing to fully manage, and I guess this opens longer discussion on Twitter whatever we call deployment on VPC today, which I have personally no desire to go back to VPC knowing what things cost and what the limits are, [00:41:00.29] kinda nice, almost like engineering or something.
[00:41:05.11] ALEX: It makes me mad about these other things that don’t happen to have those maps and it’s just like how do you have any confidence in what to happen next. I know it’s gonna be a Saturday night and I’m going to get paged on this because there’s a limit like you said, there’s a limit, we know what it is, we’re gonna hit it someday, so. Yeah.
[00:41:22.19] BRIAN: Yeah. This is the great joy of the whole moving to the Cloud, you kinda can, like, you know, you know how many transactions are supported by API, cause it’s SLA I have with Amazon, trust they can make it, as opposed to three servers on Black Friday that get blown it out of the water. And you gotta have four.
[00:41:48.07] SIMON: Well I’ve been in a fugue state since somebody mentioned Oracle. Hopefully, I haven’t missed anything important. But I do know that Ross asked a question, so while Alex is working on getting his camera back, Ross, do you want to jump in?
[00:42:01.27] ROSS: Yeah, sure, I have a bunch of random notes I’ve written down here, talk about. First I want to say, thanks Alex of course, everybody else stuff I actually haven’t read the book, I watched it, I’m very appreciative of the videos. I learned a bunch about it through the videos, so thanks for that. Very good help. Obviously, I come from a “normalize everything” background and without the book and help it would’ve been a hot mess, so super, super helpful. I think though, probably the, like top of my topic I could talk about right now is really the Redis version of Dynamo. I’m looking at store, I guess it would be considered temporary data, smaller things like literally key-value pairs for different server lists, things like state for Lambdas things like that that need, any reason to, I mean Dynamo is so easy to spin up so it’s very, especially with something like Arc when you don’t even think about it, you just kind of get it, right, it’s just data. So is there any reason to opt, like is there any reason to choose Redis?
[00:43:15.09] ALEX: Yeah, so, I mean, I love Redis too, I think it’s super interesting, you know, if you’re super latency conscious functions, you know, Redis is going to be faster than Dynamo for some of those things. I generally don’t reach for it as much anymore just because it’s a bit of a hassle with lambda functions right because now you need to have a VPC, your functions may be in the VPC, you know, they’ve improved some of the VPC stuff but I think it makes the deployment a little trickier now deployments take longer, sometimes the (inaudible) take longer, just like a hassle with that. So I generally tell people, avoid the VPC and thus avoid Redis, if you can, with serverless architecture. It’s not that it is impossible because it adds some complexity but that’s what I’d say. I do like some things about Redis, right, like it’s got some really interesting data structure and just like few things that Dynamo doesn’t have is pretty interesting and then of course the speed of it is so interesting. You know finally, I’ll just mention like the persistence around Redis, you know, like. . . not that your Redis cluster is going to be blown up all the time and die but just having a memory-based store as compared to having Dynamo, you know, if something does happen you might be losing data. You know there is MemoryDB from Amazon now that helps with some of that. So, a few different trade-offs there, but yeah I think in most cases, I think for sure with serverless architecture is I just use, I just go with Dynamo for a lot of those things.
[00:44:50.21] ROSS: For caching and most things you need Redis for. So, I don’t know why I feel weird about this, but ask that, because, since I went down this route, but why do I get an odd feeling that I have a single item tied to a partition key? Like, I see the partition key, right, or the sort key, there’s just one item. There’s always gonna be one item. I’m never storing more than one thing in a partition, and something about that bothers me, so I go back and look at my data. Well, I got this 10 gigs. What am I gonna do with it, you know? I don’t know why that bothered me, do I need to just like get out of that mindset or like, what is that?
[00:45:31.18] ALEX: Yeah, absolutely. Find peace with that. Honestly, I think one issue I see people, when they first start seeing the single table and things, how that works, “Oh, I’m gonna change all my related items into a single partition.” Now we say, hey, if you never access items together, they should not be in the same partition. If you don’t have a query, (that fetches) those two items, you know, whether it’s different types or whatever, don’t put them together. Because what you’re doing, you’re spreading them across the partition, cause you’re allowing Dynamo to help with that scalability or they can chunk up your partitions a lot better, spread that load across where you don’t have to worry about it as much. So, you know, I use the same partition as a rule, when you need it, you know, (I’d say) as a last resort, when you actually need it because you want to do that query that fetch them all together. But if you don’t have that access pattern, split them up, let Dynamo help you.
[00:46:24.07] ROSS: So if you’re not doing like begins with or contains queries and short keys to get multiple results, then you’re probably doing it wrong by storing them in there.
[00:46:33.17] ALEX: I think it doesn’t have to be begins with you know, cause you can have a query that says, hey, give me everything in this partition, you know, it could be a user and that user’s orders you can have that happen, but if you don’t have a query, you know, query multiple items based on partition, if you don’t have a query that fetches whatever items you jam into a partition together if you never had a query that does that, then I wouldn’t put them together in the same partition.
[00:47:01.17] ROSS: Very cool. That’s helpful. I don’t want to take up too much time from Alex I’ll throw my last one on there. In the videos, everything seemed heavy on GSIs, right, for, on indexes. I didn’t see much on LSIs at all, and it kind of gave me the perception that that’s bad. Is there some reason, I don’t know, honestly, my breadth of Dynamo knowledge is your videos and they don’t talk about them, I don’t know what they are, so I don’t know when or when not to use them, so I just haven’t at all. Can you talk a little about the difference there and why we might care about one over the other?
[00:47:33.14] ALEX: Yeah, I should’ve probably covered that, but I didn’t, so I know I have a GSI, a post on how to choose your secondary index on the DynamoDB Guide if you don’t have that out. But the big news there, LSIs are just restrictive in a lot of ways, so I always, I usually tell people, hey these two exist, but almost always with GSIs. The downside of GSIs is they have to be created when the tables are created. They also have to use the same partition key of your table, and a lot of times you re-partition your data.
[00:48:10.26] RYAN BLOCK: Can I just ask a clarification? It sounded like you just said that GSI need to be created when the tables are created?
[00:48:21.26] ALEX: Yeah, I must’ve misspoken. So yeah, LSIs need to be created when your table’s created. GSIs you can add when you create your table or anytime down the road, so. . . so that’s the big downside of LSI, if you also can’t move an LSI later if you need to, it’s just like you create it and it’s there forever. The bigger difference (the hard ones are) you have to have that same partition key which is a huge limitation cause a lot of times people need to re-partition their data in a different way for this additional (access pattern and that really hurts them) LSI is basically giving you different sorting but not different sharding, right, and that’s pretty limiting. And the last thing that’s tricky to LSIs is they will enforce an item collection limit of 10 Gigabytes at maximum. So we talk about item collection, the, all items with the same partition key. If you have an LSI, you cannot have an item collection worth more than 10 gigs of data, and that includes items (in any) table, and items on your LSI will share the same partition. So, you know, if you had user, Ross, whatever, you have all your order to things like that, you have your 5 gigs and your (inaudible) table, 5 gigs in your (secondary) index, and you tried to write a new item in, you couldn’t do that here, the table actually blocks that from happening, that’s not gonna happen with the global secondary index. That’s a huge scary limitation so it makes me not want to use LSIs. The main reason you might want to use an LSI is because you can get consistent (reads on an) LSI if you need it, whereas in a GSI, is asynchronously replicating so you’re only getting eventually consistent read to probably delay of sub 500 to, 500 milliseconds to a second somewhere below that. But it’s usually pretty close but you can get some inconsistent reads there, whereas with an LSI you can actually get strong consistent read if you need them.
[00:50:19.24] ROSS: All right. So I guess still stay away from them unless you really have a consistent pattern that you need that isn’t solved by your original PSK.
[00:50:32.23] ALEX: Yeah, correct. One other thing that’s like sort of nice, but I think that the bigger factor is that LSIs are shared throughput with your main table, so like where the GSI is provisioned separately. It could feel a little less to manage in terms of you’re not gonna manage throughput for your main table and for the LSI right, but it’s not that big of a deal and also if you’re using on-demand it’s not going to matter pretty much. So yeah there’s that other big one is the consistent read if you need that, LSIs can do that too.
[00:51:13.11] RYAN BLOCK: You mentioned that you had (inaudible) such a massive (inaudible) LSIs have a 10 gig limit as well, GSIs do not have.
[00:51:24.00] ALEX: So it’s gonna be per item collection, which means per set of items with the same partition key, is when that 10 gig limit comes in. But yeah, with LSIs, that’s strictly enforced. So you cannot have more than 10 gigs of data the same partition key and that combines your (main table and) secondary index where that partition key’s LSI. So you can’t exceed that, it’s actually going to block the write if it happens. Whereas if you have a table that doesn’t have any LSIs on it, but has some GSIs, you won’t have that limit at all. So Dynamo’s generally behind the scenes, segregating your data into partitions of 10, but you won’t have, that’s not really apparent to you or shown to you unless you’re using LSIs where it’s going to come up.
[00:52:17.06] SIMON: So, I know that Bojan wanted to jump in with a comment, they’re having some problems with their camera as well but their audio is okay, Bojan?
[00:52:24.07] BOJAN: Hello, can you hear me?
[00:52:25.13] ALEX: Yeah, yeah, we can hear you.
[00:52:27.07] BOJAN: Yeah I gave permission to use the camera but for some reason, it doesn’t work. Yeah, thanks everybody, first time in your Discord and everything, and everything is great. So I have, I had repeat questions. First one was related to innovation but you’ve already asked exactly what I wanted to ask, so you already answered it. The second one is more like Alex do you have any sales in pitching tips to leadership, because I find myself very, very difficult to pitch for DynamoDB to switch my company to it the default basically ask somebody else did that right, while (inaudible) you know it’s like Lambda and Dynamo, oh wait a minute, it’s not real. Can I ask you know people who ever used it right. So, you know, basically it boils down to two months of talk and it was like, eh, still not enough, but it was like, yeah (inaudible) basically it’s like it’s never enough to tell somebody to just work, right, so any tips for like pitching to companies, why Dynamo, why not something else? How do you approach this?
[00:54:05.26] ALEX: Yeah, yeah, for sure. Yeah, good question. First of all, welcome to the Arc Discord, and make sure you check out Arc great folks. But yeah, it’s a question that I think a lot of folks run into. I think it is tricky for Dynamo because, you know, even with some proof of concept it’s hard to really understand that I think a lot of the benefits of Dynamo are back loaded, where you don’t have that operational program you have that predictability, stuff is not going to blow up on you, you know what happens. But for the costs are sort of front-loaded so in my experience, you know, I think the great thing about Dynamo is that it makes trade-offs very clearly, what it’s really good at, It’s good at predictability and consistency, knowability, fully managed nature, same thing with a lot of Lambda type stuff because you’re just bringing serverless technology. Those are going to be the big benefits you have there. But, you know, you have, but be honest about the cost as well, say, hey Dynamo it’s going to take a little time to learn, it’s gonna be different from a relational database. It’s not impossible but, you know, we’re gonna have to take some time to do that. Second, you know, it’s gonna have a little less flexibility than relational databases, that might be frustrating. Honestly, you know, you can approach them however you can. So like you tell them the trade-offs and you say why you advocate for it. . . and, you know, I think that in terms of knowing the serverless piece is as a whole you don’t make a proof of concept say, hey, I have this auto-scaling compute fully managed database, and here’s the thing, I did this in like a day. You know, just spin the stuff that’s pretty interesting and nice to show that. But some of it, especially around Dynamo is like they gotta be willing to take on those back loaded benefits and take a little bit of time upfront.
[00:56:10.10] BOJAN: Yeah, and I’m also hoping like, just last week we had a spike because of COVID changes 5X but of course, my service is REST service Lambda scale 5X no problem but because this service depends on another service, and then we saw some issues but basically downstream with is an infinite scale problem.
[00:56:35.12] ALEX: Yeah.
[00:56:36.10] BOJAN: But it’s a good problem (inaudible) be fine, right? And just a last question is on scaling so do you have any recommendations at which point to ask for Amazon discounts or some credits I know there’s some–
[00:56:58.24] BRIAN: You gotta ask them all the time, everywhere. People will throw credits at you. So Twitter, AWS summit, they will give you credits.
[00:57:10.15] BOJAN: We already had one table is, you know, multiple Terabytes and basically cost like $5000 to make, you know, it’s only going to get bigger, and we’re not like, I mean I’m very hesitant to find a copy of the similar traffic everything, so, but yeah, I know that at some point someone will ask, why did this cost tens of thousands of dollars since Lambda cost hundreds of dollars, right (database) is like 90% of the cost. So I’m just like the usual take on this, on-demand computer system, just ask about. So any tips for that?
[00:57:56.28] ALEX: In terms of like giving credits or negotiate discount, then I don’t have a lot of visibility into that. I don’t really touch the bills too much or specific like that. The one thing I would say is if you’re running $5000 a month, it might be worth seeing if it makes sense to move to on-demand for you, especially if that usage is somewhat predictable. You know, and I say “somewhat predictable”, a lot of people find setting it at like, you know, 10 or 15% above their peak traffic and send it back actually save a decent amount of money on it even if it’s only using 20% during down periods, you know, it’s using maybe 90% during high periods. It depends on how scalable your traffic is. I would take a look at that. I dunno if I’d look into auto-scaling for you, I’m sort of leery about auto-scaling, just cause it’s sometimes a little finicky to get right and, so that depends on your patterns again, if they’re fairly predictable, waves rather than sort of spiking up, then that could be good for you too. But I would take a look at moving from on demand to provisioned, you know, having $60,000 a year depends on, I would take a first cut at it and see, does spend amount of time on it if you’re like, hey, there’s it depends.
[00:59:14.20] BOJAN: Yeah, it’s a calculation of your (inaudible) heavy, but there’s no right, so they kind of put it in the paper, it’s like (on demand) kind of see (inaudible) every member setting up all the scaling of this (inaudible) on demand.
[00:59:32.26] ALEX: Yeah, yeah. I really wish, I really wish they would do like a combined provision on-demand, “I wanna have this much provision and I’ll pay for it no matter what, but also any request that comes above that, just charge me on demand, double the on-demand rate and just like, just never reject a request basically, that’d be awesome.”
[00:59:55.27] BRIAN: I wanted to chime in on this one, just as a hint. Amazon always looking for positive stories you’re willing to share, they really, really need customer stories and they will happily trade lots of credits for those stories. Ping your local dev rel or user group and get their face on Twitter, get pretty neat discounts just in the way of credits. The other thing here, senior leadership is freaking out over how expensive a database is, ask them how much DBA is, cause Dynamo looks expensive until you shard a classic relational database system that needs a full-time DBA to manage Dynamo looks pretty cheap
[01:00:43.07] BOJAN: No, but we’re pretty cool about tech expenses. I just assume some people ask at some point, okay. Thank you so much, Alex, I really appreciate your work.
[01:00:56.19] SIMON: Yeah, thanks for joining the Discord for this, Bojan, and thanks for sharing that story, the proof of concept that didn’t immediately go into production. That was a first-time thing for me. And my colleague, Ryan Bethel’s been waiting so patiently to get a question in here, so let’s just throw it over to Ryan before we have to let Alex go.
[01:01:18.17] RYAN BETHEL: Hi Alex, thanks a lot. What do you use to, in your designing schemas, like what tools, cause it helps to digitize them, you know, like even the schemas that you have in the book, do you just do that in the spreadsheet, or what tools do you find helpful?
[01:01:36.22] ALEX: So, I’ve gone sort of around and around on this, you know, I use NoSQL Workbench to work a lot when I’m demonstrating stuff to other people, like not when I’m making slides or need to demonstrate a bigger concept of how something would go together or how I’m relating things like that, calling workbench images or stuff like that. But mostly what I do, I’m like a data modeling doc that I want to release and write a post on how to use it, but it’s basically like you block in, block in these steps, find your, first describe your application, what it’s doing, the different entities you have, some of the constraints on them, you know, say the different items you’re gonna have on a table, here’s how to model in the primary key, here’s the pattern for the GSI to reach those items, all those sorts of things. If you have important attributes that are like not in the . . . key, primary key but are important for some other way you need to use condition expressions or a counter something like that, describe those somewhere, and then have an access pattern so whichever way you do this, every single access pattern you need to fetch customer, here’s how to use the query here’s how to use the PK or secondary index, how to use conditions or filter expressions, things like that. So, I just have a bunch of ad hoc type things. The biggest thing, like if this were a template I have a design, you have a design doc at the end that if you want to know about the application, what’s happening in your application, they can know what items look like, they want to look at the access patterns, want to make sure that what they see in the code is lining up with their thing, they can go look at, you got some built out design doc on how that works.
[01:03:23.24] RYAN BETHEL: Awesome. Awesome, you should sell that, I would buy it. Make a template.
[01:03:30.20] BRIAN: Sounds awesome.
[01:03:35.01] ROSS: I would highly recommend Dynobase, I’ve been using it for a while now, it’s been awesome. You can describe your models, it works really, really well. And then the different views in it, it’s been. . . it was worth buying. I paid money for it and I, it’s been great to just teamwork together on a model of the data.
[01:03:57.02] ALEX: Yeah, totally agree. Rafal has done great with Dynobase and it’s totally awesome. He’s really taken a lot of really great feedback on that. One thing I like about him, especially compared to NoSQL Workbench I would say is it’s more the whole life cycle, it can help you with stuff up front but it can also help you inspect your tables once it’s live talking about doing those sort of migration operations, you know, if you have a smaller table you can actually just run your schemas, scan update process directly in that spaces so yeah, definitely check out Dynobase he’s doing more, like code generation type stuff, like a lot of interesting stuff there.
[01:04:38.18] SIMON: Okay. I know it’s just after the hour here, Roxie had a follow-up question. Why is there a constraint that you have to create LSIs during table creation? Maybe that’s an easy one.
[01:04:52.02] ALEX: That’s, that’s a good question. I’m guessing it has something to do with that like 10 Gigabyte partition that it might have slightly different infrastructure other than how that works I think that an item collection can’t exceed 10 gigs, where if you added that after the fact of how would you handle it if you happened to already have an item collection that exceeds 10 gigs? Do you block the creation of LSI’s? Do you do things like that? So I’m guessing it has something to do with that, and then also the reason they have that 10 gig limit, right, is they want to have consistent reads, so when, you know, when the write comes in, they want to be able to go to the primary and write it to that main table also write it to that secondary index on the same partition so they can do that pretty quickly, where the GSIs are happening. So I’m guessing that’s why, I also wonder if LSIs are just used less, and so, I was actually looking at some of the history of like the DynamoDB features, yesterday’s preparing for today and, so first of all, they released Dynamo, there was no indexes at all, then they released LSIs, I think it was a year and a half later. Then GSIs were 6 months after that, so 2 years after Dynamo launched, and (folks) at that time hadn’t created tables and then later on they allowed to create GSIs any time you want and things like that. And I’m guessing probably that 10 Gigabyte limit and how that changes infrastructure and I’m guessing that’s probably the dragon catcher on why you can’t downsize after the fact.
[01:06:33.28] SIMON: Thanks for taking the time to answer that. And on behalf of everybody here, I just want to say thank you for spending over an hour with us now, the book was amazing and I’m so happy that we had a chance to talk with you. Folks let us know how we did, what you think of this format. Should our next book club meeting be? We want to know. So I’m going to pop over to my other computer and stop recording, if anybody wants to hang out and chat a bit more you can do that too.
[01:07:02.24] RYAN BLOCK: Before we do that we should make sure to get the opportunity to have Alex plug whatever he wants.
[01:07:08.23] SIMON: Oh wow, man, my marketing sense is just gone away. I’m so sorry, if you haven’t gone to dynamodbbook.com I don’t know what’s wrong with you. Alex?
[01:07:20.04] ALEX: That’s it, you know, go to dynamodbbook.com, check out my Twitter, my website, this has been great, I really appreciate it, and for everyone that bought the book thank you so much. I really appreciate it and feel free to, if you have questions that didn’t get answered, you’re welcome to, after the fact, feel free to hit me up on Twitter, happy to discuss, you know, stuff with you and, yeah, just let me know, so.
[01:07:45.08] SIMON: Awesome. Thanks so much.
Don’t miss the next book club meeting
Stay in touch by:
- Joining the Architect Discord, where we will be hosting the book club video chat.
- Follow the @begin Twitter account, where we will send out polls for future book club selections.
Or if you prefer emails join the book club newsletter. We promise that we will only use this mailing list for book club purposes like meet-up reminders and book club selections. We do not sell your personal data.