Scaling Dispo to 10 million users

Mark Farnum
5 min readFeb 27, 2021

Dispo, the new photo-sharing app from David Dobrik, is blowing up.

Last week, after interest surged in Japan, the beta version on iOS quickly hit the TestFlight limit of 10,000 users. Soon after, the L.A. startup raised $20 million at a $200 million valuation. Out of beta now, the app is still invite-only, and the twitter hype / FOMO is everywhere. All signs point to hockey stick shaped growth 🏒 a.k.a. ⤴.

Why is it exploding? 🏒

Before we get into the implications of scaling this kind of app, let’s ask the obvious: Why is it blowing up? What sets this app apart?

There are a few key features that make Dispo special:

  • Photos are taken through a tiny viewfinder, can’t be edited, and have to “develop” overnight. This approach helps users capture authentically instead of obsessing over the perfect shot, and speaks to Dispo’s motto, “Live in the moment.”
  • Photos can be organized into themed collections called “Rolls,” and Rolls can be collaborative. Rolls are the building blocks of Dispo’s community. People are creating group rolls with their friends to share experiences, and themed rolls for photos of cats, poems, etc.
  • That classic “disposable camera look.” Dispo adds a retro filter to every photo.

With that addressed, let’s look at some of the challenges this type of app could run into as it scales to 10 million users and beyond.

Want to read this story later? Save it in Journal.

How to scale? 🧗‍♂️

Let’s put together a theoretical architecture for Dispo. Their job postings say they run Python on AWS, so we’ll start there. What’s important as we scale?

  1. Abstraction. We’ll need to respond quickly and automatically to explosive increases in traffic. Managing a large cluster of EC2 instances, even with the newer features for managing EC2 on ECS, takes much more overhead than running on ECS Fargate. Similarly, ECS itself takes more overhead than using a further abstracted service like AWS Lambda. In general, we’ll use the most abstraction we can that doesn’t sacrifice functionality.
  2. More abstraction! We’ll also abstract away the work of infrastructure management. Infrastructure as Code tools are key for this—IaC lets us create reusable, adaptable, and documented infrastructure. Terraform, CloudFormation, and Pulumi are a few good options.
  3. Stateless services. We’re going to favor horizontal scaling (more instances) over vertical scaling (more powerful instance), since it’s easier to quickly change the level of scaling that way. But, this means our services can’t be stateful, since they may be created or destroyed at any time. We’ll need to rely on our database for all persistent state, and maybe a cache server like Redis to handle caching, pub-sub, etc.
  4. Security. Security is always a priority, from properly seeding password hashes to ensuring that no service in the architecture has more permissions than is necessary. We’ll use the principle of “least privilege” as we create Security Groups and IAM Users and Roles.
  5. Load testing. We need to test our infrastructure to make sure it can handle the load we’re expecting. We’ll want to establish benchmarks at low load and compare as we add load. There are a number of tools for this, from managed cloud platforms like LoadView to open-source solutions like JMeter.
  6. Optimization. Slow queries, inefficient algorithms, and other optimizable issues will become increasingly significant problems as we scale. It’s important to address these, but also important to remember that “premature optimization is the root of all evil.” We must only optimize the things that matter, and for Python we can use tools like cProfile to diagnose.
  7. Availability. We can use multi-AZ load balancers to ensure we have enough redundancy to tolerate issues that are isolated to an Availability Zone. Theoretically we could implement a multi-cloud strategy, but the cost/benefit likely isn’t worth it for a young startup like Dispo. We can using sharding for increased database availability, although that brings its own challenges.

What services will we need? 🤔

We need to find the balance between monolith and microservices-for-days™️. A monolith is too coupled for our needs, and too many microservices is an over-optimization (waste of ⏱ and 💰).

  • User data API. Auth, CRUD for photos / rolls / comments / messages / other user data, WebSockets for persistent connection.
  • Photo filter. Every photo taken needs to have the Dispo vintage filter applied. Interestingly, this doesn’t need to be immediate because of the nature of Dispo’s delayed photo “development.” Sounds like a job for… a queue! We can queue photos as they come in and scale our photo filter service based on this queue. We’ll use S3 for storage.
  • Recommendation Engine. Let’s help our users find Rolls and communities that they can connect with and contribute to! We can run a service periodically that converts our network of users, photos, and Rolls into a graph structure that can be used to generate recommendations with collaborative filtering or ML.
  • Notifications. Some notifications will be triggered instantly by user actions (“Natalie added a photo to your Roll!”, “New message from @Dlisscious”) and others will be scheduled based on user behavior (“12 of your photos have developed!”, “Mint an NFT now!”). We could use a notifications solution like OneSignal or roll our own if needed, but either way we’ll want to run a job periodically for non-instant notifications (a scheduler).
  • Metrics. We can use Redis to keep track of the number of active users on the platform at any time (and many other metrics), but we’ll also want a service to periodically record these metrics so we can see how they change over time.

Let’s build it! 🔨

Based on the above considerations, here’s our potential architecture:

Not pictured:

  • Pulumi for managing all of these resources
  • The Security Groups and IAM Users, Roles, and Policies that determine each resource’s permissions
  • VPC Peering connection between the AWS VPC and MongoDB Atlas
  • Load testing tools
  • Staging environment
  • AWS CloudWatch Logs, CloudWatch Alarms, etc.

Conclusion 📸

Whew! That was a lot of details, but any of the tools or resources I mentioned above could easily be swapped out for another—those specifics aren’t that important.

What’s crucial is the approach you take when scaling. Know your architecture. Improve what actually needs improving. Listen to your users and serve your community. Don’t reinvent the wheel, use the tools available to you. And above all: move quickly while still doing things the right way.

📝 Save this story in Journal.

--

--