Web Scale”

Jacob Kaplan-Moss

October 28, 2010

Christophe Pettus:

What does [“web scale”] mean?

It clearly means something along the lines of, “Can handle lots of transactions per unit time,” but how many?

I mean, WordPress with WP-SuperCache is “web scale” if all that is meant is, “Can be used to implement a high volume site,” but I assume those who are touting something as “web scale” are aiming higher than that.

Anyone care to offer a quantitative definition of this term?

Since I tend to use “web scale” to describe the types of problems we try to tackle at Revsys I figure I should try to take a stab at answering Christophe’s question.

Like nearly everything about our industry, there’s sadly a bit of hype in the term “web scale.” It seems that many like to use the term as a fancy synonym for “big,” and I think it’s that sloppy hyperbolic use that Christophe (and I) object to. It’s easy to say “ooh, we do a lot of traffic — we’re Web Scale!”

But I think this ignores a fundamental difference between traffic patterns on modern web sites and other sorts of traffic. Most successful web sites — and certainly the ones I’d call “web scale” — have a strong social aspect to them. These sites aren’t straightforward read/write operations, but instead exhibit strong network effects.

Why does that matter?

Two words: Reed’s law. Reed’s law states that “the utility of large networks, particularly social networks, can scale exponentially with the size of the network.” The proof’s pretty simple: if you have N people in a network, you have 2N possible groups (i.e. subnetworks).

Now, Reed’s law talks about utility — value, roughly — but it isn’t hard to see that traffic across social networks follows similar laws. Users use networks more as they make more connections, and as the network size grows there are more possible connections to make. Each new user adds a lot more than a linear increase to traffic and resource use. Sure, it’s not literally 2N growth, but the point is that sites exhibiting network effects see traffic grow at rate far beyond linear.

To me, then, “web scale” describes the tendency of modern sites — especially social ones — to grow at (far-)greater-than-linear rates. Tools that claim to be “web scale” are (I hope) claiming to handle rapid growth efficiently and not have bottlenecks that require rearchitecting at critical moments.

The implication for “web scale” operations engineers is that we have to understand this non-linear network effect. Network effects have a profound effect on architecture, tool choice, system design, and especially capacity planning.

So I think “web scale,” despite the hype and hyperbole, is an important concept.

Comments:

Christophe Pettus:

That's a perfectly reasonable definition, but it raises a very interesting conclusion: There is no such thing as a "web-scale" tool; there are just "web-scale" architectures.

For one example, any data store (PostgreSQL MongoDB, VSAM) runs as fast as it runs for any particular application and load set. What makes an application scale is the use it makes of that data store as its load increases.

Slapping a "web-scale" label on a tool is much more about marketing than about reality (and it's always easier to specify tools than build architectures, anyway).

Chris Smith:

Umm, so the number of possible connections in a set of n elements actually grows like n^2, not 2^n. That's a pretty massive difference. If performance demands really grew proportionally to 2^n, then the best strategy would be to take up corn farming instead of even trying to build "web scale" software. But hey, what's a little math between friends?

Geoff H:

"Exponential scaling" always worked for me.

JKM:

"There is no such thing as a "web-scale" tool; there are just "web-scale" architectures."

Bingo - well said.

I think though there *are* tools that encourage web scale architectures better than others. For example, I'm hugely skeptical of the NoSQL fad, but at the same time it's clear that the typical 3NF tends not to handle "web scale" as well as a denormalized layout. IOW, by making users operate at a lower level NoSQL tends to encourage more creative and optimal data layouts.

JKM:

Chris, we get 2^N because we're looking at the number of *sets of possible sub-networks*, but just the number of connections. I didn't make that clear myself, but the Wikipedia page does a good job. I'll edit the article to try to clarify a bit.

Jeff:

If a customer of Amazon's cloud is "web scale", what term do we have for Amazon itself? Or, perhaps, if Google is "web scale", what term do we use for "yep, I have a Google App Engine site or even a couple of them"?

People keep using the term "web scale" for companies that are existing on the scraps left over after genuinely large companies built their infrastructure.

I'm sticking with "web scale" means "we had to build our own datacenter". And I don't care if you use SQL Server, MongoDB or even MySQL in that datacenter.

Merely renting a dozen servers in a datacenter with a thousand servers, or renting space in Amazon's cloud, doesn't impress me as particulaly vast resource requirements.

Thomas Sutton:

As far as interpretations of "web scale" yours is about as good as any I've seen, but I very much doubt that's what most people think when they write and read it. I'd expect (but, of course, it's just my relatively uninformed opinion) that most people interpret it in the hyperbolic "big" sense that you mentioned.

I think that we'd all be much better off using precise terms like Geoff H's "exponential scaling" than handwavey analogous terms that need to be analysed to try to understand them (but make much better marketing copy).

Jeremy Dunck:

"Exponential" is sadly meaningless to all non-technical people, as it's been misused in so many contexts that it's as clichéd as "up and to the right".

Meaning is hard, let's make stuff up.

Christophe Pettus:

Even though SQL databases are my life, I view the NoSQL trend pretty benignly. I mean, VSAM was pretty good in its day, too. :)

More seriously, there are applications for which an SQL database is just not appropriate. I'd never back up Twitter, for example, with an SQL database as the primary store; it just isn't the right fit for the problem.

My concern comes in putting the production selection cart before the requirements analysis horse. The danger in throwing around terms like "web-scale" for tools is that it promotes a sloppy, "Well, who *doesn't* want to be 'web-scale'? We sure do! We're going to be HUGE! Let's use a NoSQL database, because those are 'web-scale too'" kind of thinking.

John Haugeland:

Connection utilities are actually governed by Beckstrom's law, not Reed's, and Beckstrom's has an upper bound on a fully connected network of the traffic model times the Metcalfe value of the network, which is (n*(n-1))/2, certainly not an exponential. Reed's law is ridiculous: it assumes every transaction made benefits every user. I haven't watched every YouTube video or read every Facebook comment; have you?

Beckstrom's law sets the network's value as the sum of the per-user values, themselves defined as the sums of per-user-pair transaction values divided by rates of interest amortized over time, minus similarly prepared costs.

Web-scale is just a bunch of NoSQL fappers pretending they understand database technology, which they don't. You grabbing random laws out of wikipedia and guessing that they might somehow justify the fappers isn't helping matters.

John Haugeland:

By the by, you can actually measure these things; your traffic model, for example, is almost certainly a zipf distribution, and zipf distributions - while they don't grow linearly - *scale* linearly (increasing the top end lowers the bottom end, and the sum volume works out to a linear increase.)

Back here in reality land, if you aren't working from measurements, you're just a blog masturbator.

Rikard Kjellberg:

It seems "web scale" is a label you can put on a system that very quickly (and automatically) can ramp up new resources as the resource requirements increase (often exponentially). As such, Amazon EC2 can be called "web scale". A system running on EC2 may or, may not be web scale. It would only be "web scale" if the app made intelligent use of the EC2 resources available. I would add that "web scale" implies: detecting need for additional capacity, adding that capacity, detecting a decline in capacity need and, removing unnecessary capacity. Properly implemented, any cloud-based application would be "web scale", IMO.

Leave a comment:

Use your real name, or risk deletion.

Optional.

No markup allowed. Linebreaks will be converted; links will be linkified.

Be nice; don't be that guy.