Jacob Kaplan-Moss

Work Sample Tests:

Labs & Simulation Environments

Welcome back to my series on work sample tests. This post describes one of the kinds of work sample test I use (and suggest others use as well). For some theory and background, see the first few entries in the series: introduction, the tradeoff between inclusivity and predictive value, and my “rules of the road” for good work sample tests.

Background

As I’ve been saying repeatedly in this series, a work sample test should simulate real work as closely as possible. For roles that primarily involve writing software, that means writing software, and previously in this series I’ve explored several options involving writing (and reading) code.

But I’ve also hired for roles that don’t involve day-to-day coding: what about roles like security analysis, penetration testing, technical support, bug bounty triage, project or program management, systems administration, technical operations, and so on1? For those roles, I turn to simulated, “lab”-style environments. These are tests where you create an environment that simulates real work and turn the candidate loose in the simulation. Something like a ticketing system with support requests for the candidate to triage, or a web app that crashes under load for the candidate needs to debug, and so on.

Unlike the previous exercises I’ve covered, where I’ve tried to give enough detail for you to use one of those exercises more or less as-is, this a post for more advanced readers. I’ll cover a couple of examples in some detail, and a few more as broad sketches. I won’t go into much depth on any particular test. The goal is some inspiration for you to develop your lab environments.

Who this is for: lab-style work sample tests work best for roles that involve interacting with complex systems. Unlike software development, where the role involves building the system, these roles treat the system as a sort of black box. They’re trying to break it, or debug it, or otherwise interact with it. Or, the “system” could be a human system, as in the case with technical support, project management, bug bounty triage, and so on.

What it measures: designed correctly, these kinds of tests are terrifically accurate since they can be designed to simulate the real job very closely. If you’re hiring a penetration tester and want to know that they can find vulnerabilities in a running system, giving them a running system and seeing which vulnerabilities they find is a terrifically accurate test.

Examples

Here are some example labs I’ve used successfully. First, a couple of more fleshed out ones:

Interact with a ticket queue

Who this is for: roles front-line customer support, bug bounty triage, IT help desk, some SOC roles, etc.

Set up: you create a ticket queue that simulates the kind of support tickets this role would need to process. I don’t recommend your real ticket queue software; if the candidate is unfamiliar with it (likely) you risk mistaking difficulty using the queue software with their ability to perform the underlying work. Instead, I recommend something simple like a written document or spreadsheet.

The tickets you use can be real tickets (or bug bounty reports, help desk requests, etc.), or fake ones that you write. If they’re real, you’ll want to curate carefully – don’t just use a random selection from your queue. You want to make sure there’s a mix of easy and hard tickets, covering a representative range of your real work.

The test: have the candidate work through the tickets and take whatever “next steps” they’d need to take in real life. For example: validating a bug bounty report, or replying to a help desk request, and so forth. Once again, I recommend having the candidate do this in something simple like a document or spreadsheet, rather than having them use a real system.

You may want candidates to perform this test synchronously: it’s frequently the case for roles like this that speed matters; you may want to know how many tickets a candidate can process in an hour. I don’t recommend doing this by having the candidate work while on a video call with you, or in a room with you standing over them; that’s weird and too much pressure. Instead, give the candidate the outline of what you’ll be requesting, and a few examples so they can practice, and then have them schedule the test time. At the beginning of the scheduled time, send them the tickets. Have them send you their results at the end of the time block.

Speaking of that time block: this is an exception to my general three-hour rule (rule #2). That’s way too long; I recommend an hour for this, maybe ninety minutes, max. You don’t need more time than that to gather good data on how a candidate performs.

Post-test interview: like most work sample tests, you’ll want to have a conversation about the test; it’s not a simple pass-fail (rule #5). This can be a relatively simple debrief: ask about their approach, what went well, what they wish they’d done differently, etc. 30 minutes should be plenty for this debrief.

What behaviors to measure: the primary measurement you’ll make here is accuracy – how correct are the candidate’s answers? You’ll probably also be able to look at communication skills: how clear and professional are the candidate’s mock responses to requests? You may also want to measure speed – how many tickets did they get to in the hour? – but be careful with this one. You should only use it if speed is important for the role, and your standards should be far below what you’d expect of a fully-trained and onboard employee. Anyone you hire will get faster as they learn the ropes; you can’t expect candidates in a simulated environment to be that great. If they can hit even half the speed you’d expect from a full employee, you’ve got someone who’s plenty fast.

Diagnose a buggy system

Who this is for: systems administrators, DBAs, performance engineers, and other kinds of roles with heavy operational workloads.

Set up: build a system (a web app, database) that’s deliberately slow or buggy for some reason. For example, you could set up a database with a ton of data and then delete some key indexes. Or build a web app that exhibits N+1 query behavior on some routes. Or write some very buggy code that crashes given certain inputs.

This can be tricky to get right: you want something that displays the problem in a way that a qualified candidate will spot, but not so quickly that it’s a five-minute fix. You may need or want to create a system with multiple problems, to give candidates various ways of finding issues.

You’ll probably also need to build some tooling to create and destroy environments for candidates: you don’t want to share environments between candidates, so you need to create new environments on-demand (and destroy them when you’re done).

You’ll also need to create a short briefing document for the candidate which explains the test, what you’re looking for them to do, and all the details about how to access the environment.

If you’re thinking “this sounds difficult” – it is. See the discussion about this below.

The test: there are a few different ways this can go. This can be an exercise in diagnosis: ask the candidate to look at the environment and tell you what’s wrong and maybe how to fix it. This could be in writing – good for if the job requires writing ability as well – or you less formally in a meeting after they’ve had a chance to look at the environment. Or, for some tests, it makes sense to ask the candidate to fix the problem (this would work well for the database-with-missing-indexes example).

Post-test interview: if you have the candidate submit their findings in writing, the interview can be a fairly simple and short debrief. Or, it could be a longer session, where they tell you what they’ve found and you ask them about how they got there, what fixes they’d recommend, and so forth. See the other exercises in this series for some ideas of the kinds of questions to ask.

What behaviors to measure: in this test, you’re looking at two things: accuracy and communication (either written or spoken). Candidates need to correctly find, diagnose and possibly fix the issues, and you want to make sure they’re able to explain what they found/did and why.

Lightning round: three more ideas

To wrap things up, here are two more ideas, much less well-developed; these are just quick sketches that you might use as inspiration for your own tests:

  1. Find vulnerabilities in an insecure system (good for: red teams, security analysts, penetration testers, etc.). This is a variant of the “buggy system” concept above, except that in this case, the bugs are security vulnerabilities2. We use a version of this for some roles at Latacora: we give candidates access to an app with several vulnerabilities and ask them to find and exploit them. There are a number of different vulnerabilities they can find, and even a few that can be chained to discover a very high severity issue.
  2. Set up a server or laptop (good for: sysadmins, IT): have a candidate set up (or provision and set up) a new machine (e.g. an EC2 instance, Mac desktop) to some set of specifications you provide.
  3. Run a scan and interpret results (good for: some analyst roles; SOC staff): have the candidate run a security scan (e.g. Nessus, ScoutSuite) against an environment/app you’ve set up, and interpret the results for you.

Discussion

What these all have in common is the “black box” nature of the test: in each case, you set up a somewhat complex environment, and drop the candidate into it.

This makes these tests on the more difficult side to set up: there’s usually a fair bit of work you need to create these environments. The test we use at Latacora (“find vulnerabilities”), for example, includes an app, some infrastructure code (Terraform), and management tooling (a Slack bot) to provision and de-provision isolated environments for each candidate. It’s worth the time investment because we run work sample tests frequently (we help our clients hire, so we have a much more regular hiring pipeline than most companies our size).

Our case is typical; creating lab environments requires a fairly hefty up-front time investment. You’ll need to do quite a bit of internal testing (rule #7); it’s can be difficult to get these right on the first try. You should test until you get consistent results that fit within the three-hour time limit (rule #2). This could take three or four iterations, or more.

The advantage, though, is scale: once created, lab environments simulate real work very well, particularly for operational roles, and don’t require as much work for each candidate. So they make the most sense for situations where you need to hire at scale – which is often the case for the kind of operational roles we’re talking about here.

What’s next?

I’ve now covered every kind of work sample test I’ve used and can recommend! If you missed any of ’em, see the series index page.

I’ve got two more posts planned to close out this series: next, I’ll write about some styles of work sample tests that don’t work – and why they don’t work. Then, I’ll wrap up answering questions and covering any loose ends that didn’t make it in anywhere else. If you have questions about anything in the series you’d like to see addressed in that wrap-up, send me an email or tweet at me.


  1. You’ll note that I haven’t mentioned other disciplines involved in software development, like UI/UX, Design, Product Management, Marketing, Sales, and so on. This is because I don’t have much experience hiring for these areas, so it’d be dishonest of me to present myself as any sort of authority. I’d love to have some coverage of work sample tests in these areas too, though, so if you’re a hiring manager with experience using work sample tests in other areas, and would like a guest spot on my blog to talk about it, please get in touch! ↩︎

  2. It’s also very similar to Capture-the-Flag (CTF) challenges, like the ones Stripe has run. Unfortunately, off-the-shelf CTFs don’t work well as work sample tests. The answers are usually public, so it’s too easy to cheat. They also usually have only a single path to the end, and I find work sample tests work better when they have multiple successful paths. ↩︎