Jacob Kaplan-Moss

Work Sample Tests:

What doesn’t work (and why)

Previously in this series, I wrote about some “rules of the road” for effective work sample tests, and covered a bunch of types of effective tests. One thing I haven’t talked about is counter-examples: types of work sample tests that don’t work. I tend not to do this sort of thing: I find it’s usually more useful to talk about what does work than to pick apart what doesn’t. But here, I think it’s illustrative: looking at why certain kinds of work sample tests fail can help illustrate the principles of effective tests.

Contract-to-hire

In the intro to this series, I wrote

The way to know with absolute certainty that someone can do the job is to see them do the job. […] ]Of course, it’s not that simple. […] We can’t ask candidates to spend weeks or months working with us for free before deciding to hire them […].

Thus, I argued, work sample tests need to exist because we can’t work together before deciding to hire someone.

But, what if we could? What if we hired someone on a short-term contract – 3-6 months, say – and if everything worked out offered them a full-time position?

Well, this is absolutely an option. In many ways, it would be an excellent one. We could fill a position very quickly, by taking on some risk of a bad contract, but that risk would be mitigated by a clear end date. The candidate would get paid for their work, rather than having to do a bunch of unpaid work to get an offer. And if it works out, at the end of this contract period we’d have very strong evidence of whether this job is a good mutual fit.

So there’s nothing inherently wrong with contract-to-hire; in some ways, it’s nearly ideal. The problems come from matters of scale:

  • From a hiring manager’s perspective: you want to consider several candidates, right? You’d like to have a good number of applicants so that you can compare them to some degree and make the best hire for the team and role. It’s typical for me to send 5-10 candidates to the work sample test phase for any given role. At most employers, though, it’d be nearly impossible to bring on a half-dozen engineers on 3-month contracts1.
  • From a candidate’s perspective: you’d like stability, right? If you have two offers, one that’s a standard full-time job with benefits (a really big deal here in the US), the other that’s a 3-month contract with no benefits and no promise of employment after that, which would you choose?

The majority of the time, scale is part of the challenge of hiring: employers want to consider many candidates; candidates want to consider many employers. This means that although there’s nothing inherently wrong with contract-to-hire arrangements, it’s ill-suited for most hiring situations.

However, contract-to-hire does work extremely well in situations without this scale dynamic. This comes up when you’re trying to hire a specific person (rather than trying to fill a specific role). If the situation is “I want Jessie on my team” (vs “I need a Python developer”), it can make a ton of sense to start by offering Jessie a short-term contract. Other situations where short-term contracts work well are when you’re trying to hire someone who’s already a consultant or working part-time (and thus able to take on the contract easily), or when you’re dealing with a potential internal transfer.

“Starter projects”

When I worked at Heroku, we used a form of work sample test that we called a “starter project”. This is a long project, typically taking 2-3 days; candidates work on the project alongside the team in a very real-world project.

At the time, I liked this quite a bit: you develop such a good feel for working together over a few days that it makes hire/no-hire decisions very easy. Similarly, it gives the candidate a very real look into how the team functions and whether they’d enjoy working there.

Today, though, I no longer think this is an acceptable practice. The time investment is simply too great (rule #2) to be fair to candidates. At the time, I justified this by noting that candidates had never pushed back on the time investment. Now, though, I realize that they may not have pushed back because of the power difference between candidate and employer. They may have wanted something less intense, but felt powerless to ask for it.

I think this process worked well for Heroku because we were a “hot” startup; a very desirable employer at the time. Candidates accepted this practice because they wanted to work for us. As Heroku’s marquee status has slipped, I wonder if they’ve been able to keep this practice.

I’d only use a “starter project” today with very substantial changes. In particular, I wouldn’t even consider this kind of project without compensation. I’d want to compensate candidates for their time at a generous rate: somewhere on the order of $1,000 - $3,000 per day. I’d likely need to make other changes, too, such as spreading the work out over a longer period to be more sensitive to peoples' existing schedules, eliminating any on-site requirements, and more.

It isn’t that starter projects are somehow inherently “bad”. They’re extraordinarily good at predicting job performance – no wonder I liked the practice as a hiring manager! But the challenge of work sample tests is to balance inclusivity and predictive value, and starter projects fail deeply at being inclusive.

Algorithms/Data Structures Challenges

For at least 15-20 years, the most popular form of work sample test for programmers has been to ask the candidate to implement some sort of foundational algorithm or data structure. I call these “CS101 questions” because at most universities, the “101” class is the first serious upper-level class in a course of study, so “Computer Science 101” (aka “CS101”) is likely to be the class where you’d first be exposed to these kinds of problems. Some examples of these questions are:

These types of challenges have been incredibly popular, which is a real shame because they suck. I’ve worked in this field for over twenty years, and have never encountered a CS101-style problem. If I need a binary search, I’ll use bisect; writing my own would be laughable. The vast majority of programmers, like me, will never encounter one of these problems in their day jobs. This means these tests fail at the first principle of good work sample tests: they don’t simulate real work at all.

Instead, these types of tests become a measure of how much time the candidate spent preparing. Have they read Cracking the Coding Interview? Spent a few weeks grinding Leetcode? Then they’ll be more likely to pass one of these interviews. If they haven’t spent that time, they’ll do more poorly. These tests don’t predict future job performance; they measure privilege.

Dan Luu comes to the same conclusion from a different point of view. He’s worked in those rare roles where CS101 stuff is important, and even for those roles, he finds they make poor interview questions. Even in places where algorithmic improvement translates to serious organizational improvement, getting the algorithms right turns out to be the easy part. Getting those changes into production means lots of teamwork, collaboration, and office politics; far harder than any 101-level algorithm. Dan concludes:

If other companies want people to solve interview-level algorithms problems on the job perhaps they could try incentivizing people to solve algorithms problems (when relevant).

In many ways, everything I’ve written about hiring practices has been motivated by trying to present an alternative to these types of questions. They are incredibly popular, but they are ineffective and unfair. I don’t have any illusions about this changing soon: this practice is deeply ingrained in our industry. I hope I’ve shown a few people a better way.

Whiteboarding

Luckily, the other half of the algorithm interview anti-practice is on its way out. It used to be common to ask candidates to reverse that linked list by writing on a whiteboard. Today, though, we’ve mostly recognized how silly it is to ask candidates to code in dry-erase. Reams have been written about why this practice doesn’t work, and I won’t rehash it here.

Instead, I want to do something else. This is an article about practices that don’t work, and I hope you’ve picked up on a theme: these practices aren’t inherently evil; they simply fail to meet the goals of a work sample test. They don’t correlate with job duties, or they fail to be fair to candidates. I started this series with a framework for effective work sample tests because I wanted to avoid the trap of saying “X is good / Y is bad”; I want you to be able to think critically about the role and test you have in mind, and figure out from these principles whether the test will work.

Here’s what I mean: let’s take a test that is perhaps the worst possible case for most programming jobs: implementing an algorithm (e.g. “balance a binary tree”) on a whiteboard. Is there a situation where this would be an appropriate work sample test?

Sure! If you’re hiring someone to teach algorithms in a classroom setting, this would be a fantastic work sample test! Let’s examine it against that framework for effective tests:

  1. Does it simulate real work? Yes: the job involves writing (and explaining) algorithms on a whiteboard.
  2. Will it take under three hours? Easily.
  3. Can you offer flexible scheduling? Sure.
  4. Does it provide choice? It can, there are many ways to provide options (choice of algorithms, virtual learning platforms vs in person, whiteboards or blackboards2, etc.)
  5. Is the test the start of a discussion? As long as you debrief after, yes.
  6. Can this be offered without surprises? Yup: tell candidates ahead of time about the test. In fact, for a teaching position, giving them time to prepare will also help you measure their ability to prepare for a lesson, another skill that correlates with the real work.
  7. Can you test your tests? Of course.
  8. Can this be offered late in the hiring process? Yup.

See? Whiteboarding is good actually – as long as the test matches the role.

Questions?

That’s the final structured article in this series on work sample tests. I’ll wrap up next time with a grab-bag of details I couldn’t fit in elsewhere, and answer some questions.

If you have questions about anything in the series you’d like to see addressed in that wrap-up, send me an email or tweet at me.


  1. To see why it’s impossible, let’s game it out.

    First, you probably don’t have the budget to make 5-10 simultaneous contracts (at most companies, contracting budgets come out of a different place than salaries). Next, the paperwork alone would get you kicked out of most procurement offices; it’s often a ton of work bringing on new contractors. And I won’t even get started on the structure you’d need to build to make sure these people are really contractors and not considered employees.

    Even if you could somehow make it work, can you imagine how terrible the interpersonal dynamics would be? You’d have a handful of engineers, all on limited contracts, each knowing that only one of them would get the job in the end. It’d be like a season of Survivor, except with cubicles and desk chairs instead of sandy beaches.

    I guess you could try contract-to-hire in series, rather than in parallel: bring on one person at a time, and at the end of the contract either offer them a job or bring in another. But you’d like this to be efficient, so you’d want to contract with the most promising candidate first. How do you know who’s the most promising? We’re back at interviews and work sample tests. ↩︎

  2. ok that last one’s a joke. I think? Maybe teachers have strong preferences like programmers do about text editors? I dunno. ↩︎