Survey data quality: Staying ahead of AI-smart cheaters, Part I

Christina Voskoglou
Dec 18, 2023
8 min read

Updated: Jan 25, 2024

In the years of (relative) innocence

In the days before bots and click - or distributed survey - farms, and when only a handful of online surveys were going around, it was fairly straight-forward to keep survey fraud at bay. Of course, there were always those respondents who were in it for the prize, randomly clicking through, eyes focused on the money at the end of it. But with a few consistency and speeding rules, most of them were unceremoniously thrown out.

Alas, things changed quite fast. They’re about to change even faster - and in far more complicated ways - with all the GenAI tools at the disposal of those hoping to make a living out of survey prize money. But let’s take things from the beginning. For over 12 years, I too have been fighting to safeguard survey data quality. This is the story of my battles and of how the mainstream quality controls failed me one by one. Innovation is, in many cases, born of necessity.

Who is your target?

Surveys typically offer incentives to encourage participation. On the downside, these incentives - especially the monetary ones - may also attract fraudulent activity. There are numerous steps you can take to cleanse your data once it comes through, and I'll share my insights on this later on. However, prioritising prevention over remedy is crucial. One of your first lines of defence lies in selecting prizes that hold little appeal for cheaters.

That’s easier said than done, of course, particularly if you’re strategically diversifying your incentives, as you should, to avoid selection bias which is another multi-headed monster to slay.

The more niche the audience you target (i.e. the narrower its definition), the easier it is to offer audience-specific prizes that will deter anyone outside that audience from taking your survey. We were fortunate that way - at the beginning.

Barriers to entry were lowered before GenAI

The surveys we initially ran with VisionMobile (now SlashData) were only targeted to mobile developers. That was before no-code tools peaked - before the “you-don’t-need-to-be-a-developer-to-build-this” era. All we needed to do to discourage non-developers from posing as such in order to get to the prize, was to offer developer specific prizes - such as discounts for developer tooling, certifications for their career growth, and access to developer events.

That was short-lived happiness. Visual development tools, APIs, integrations and automations of all sorts flourished, substantially lowering the barriers to entry in the software development realm. The lines between who is a developer and who isn’t were blurred. Particularly as we extended our reach beyond mobile and into more creative areas such as Augmented and Virtual Reality (AR/VR), where the app builders were not necessarily coding-literate and were therefore referred to as ‘creators’ or ‘practitioners’ rather than developers. They too were now part of our target audience, and as a result we had to offer more widely-appealing prizes that were relevant to them.

Meanwhile, bots had started attacking our survey. The tech-savvy nature of our developer audience made it relatively easy for them to build bots and exploit them for multiple survey entries, enhancing their odds of winning prizes. To fend off both bots and non-developers we had introduced developer-specific CAPTCHAs. But targeting a broader audience meant that CAPTCHAs needed to be dropped as well.

Noting that this was well before GenAI came into our lives. If you thought that we suddenly switched from ninja-programmers-only to anyone-with-ChatGPT building apps, think again. The ‘democratisation’ of software development has been a process in the making for several years now - generative AI tools only accelerated it further.

About consistency and colourful fish

I had, naturally, implemented additional defences. Even if someone wasn't a bot and fell within our target group, it didn't necessarily mean they were completely honest or paying close attention to their answers. It could simply be a case of a developer being overly enthusiastic about grabbing that tooling discount.

Should the fish be employed? Using red herrings, that is introducing irrelevant questions or irrelevant answers to questions, is a classic approach. However, we were dealing with developers. Yes, red herrings might catch a few frauds, but at a high cost: The rest of our (legitimate) respondents who did notice the herring would probably call it out, most likely in developer forums or social platform groups, and ‘troll’ us about it till kingdom come. No, herrings came with a non-trivial risk, I decided, and put them back in their fish tank.

It’s also textbook questionnaire design to pose the same question twice in a different wording at two separate points in the survey to check for response consistency. Sadly, our questionnaire was quite lengthy already and repeating questions was a luxury we could not afford.

That said, I found we could check for combinations of answers to different questions that were improbable, if not impossible. For example, where a respondent identified as an inexperienced student in one question while asserting an annual income exceeding $100,000 from software development projects in another raised some eyebrows. It could, theoretically, happen, but what were the odds?

How much would you pay for sparkling clean data?

Whether I employed red herrings, consistency rules or any other cleansing rules, the question was, and always is: Where should I draw the line?

Should I throw out someone the minute they clicked on the wrong answer? There’s always the chance of honest mistakes - plus the probability of that student actually making $100,000 because they came up with a killer app. If we were to cleanse out everyone who clicked on an improbable answer we would definitely be removing a good number of legitimate responses alongside the illegitimate ones. We would have killed the virus, but probably also the human host along with it.

Kicking out even those mildly suspicious may sound like a good approach. Better safe than sorry, right? In an ideal world, perhaps. In the reality of the business world, it all depends on your appetite for risk and how much you’re prepared to pay to minimise it.

There’s a cost attached to bringing in responses, including operational, marketing and incentives costs. Each response removed has cost you money to acquire, plus extra to replace it with another if you are to collect a sample sufficiently large to yield results of satisfactory accuracy.

In our case the cost was non-negligible. Alongside students and junior developers, we also wanted to hear from specialists, such as seasoned enterprise tech leads, architects, DevOps engineers, machine learning developers and many more. Targeting a specialised audience has its advantages - such as audience-specific prizes - but also a key disadvantage: You have to try harder, and therefore pay more, to reach your audience. Especially if you’re trying to obtain a global sample, rather than one from a handful of tech hubs. And we were determined to reach all developer personas and geographies, ensuring not just a large-enough sample, but also one representative of the hugely diverse ecosystem of software development.

Staying alive

You may be thinking that the solution was glaringly obvious: Simply transfer the cost to the client. Not quite - we needed to maintain competitive pricing.

There was also the problem of timelines. Extremely rigorous cleansing whilst achieving a large and representative sample implied extended survey fielding times and consequently long project delivery timelines. That would be unavoidable even if we spent ridiculous amounts on outreach, among other reasons because third-party developer panels of acceptable quality (that could bring in responses quickly) were in low supply. Long delivery timelines could lose us projects to the competition.

Not much appetite for risk, thank you

On the other side of the scales weighed the risk of allowing fraudulent responses through. That came with a very real business cost as well. The cost of failing to deliver.

In statistics we typically determine the error we are willing to tolerate based on the impact that this will have when the conclusions of the study are acted upon. For example, in medical and pharmaceutical studies, the impact of wrong conclusions (e.g. that a drug is safe to use) can be fatal. You therefore want to bring error down to its absolute minimum. If research conclusions are used instead in business decision making (e.g. which marketing activities to invest in), then actions misled by ‘dirty’ data may cost you your budget, but most likely won’t kill you. Unless you’re running a data insights business, like we did, in which case delivering the wrong insights may kill your business.

I had strived for strong data hygiene and built a research methodology that gave SlashData a competitive advantage. I would not settle for lowering our standards and compromise one of our main business differentiators.

This was clearly an optimization problem then. But what was the optimal solution?

Growing a tree

At the same time, I had another optimization problem to solve. The development world was evolving, new technologies and areas of application emerging at a dizzying speed. Our questionnaire had to keep up, ensuring that it captured developer activity across all stacks of technologies. That meant that we should be asking more questions on a wide breadth of topics. However, a longer questionnaire could lead to higher drop-out rates, which in turn implied longer survey fielding times to meet our sample target and therefore higher outreach costs and longer project delivery times.

The solution was to optimise the survey logic, so that even though we asked enough questions to author a dozen PhDs, no respondent got to see an overly long survey or irrelevant questions. As a result, the survey logic I had designed was steadily getting more and more complex. What was initially the core and a few branches of a lean tree soon became a highly-sophisticated multi-layered structure. Goal achieved, but I was presented with a fresh challenge. Or rather, two.

Speedy Gonzales, where are you?

Right from the beginning I had implemented speeding checks based on typical survey completion times. With only a few survey branches and a fairly simple logic, completion times varied but stayed within a reasonably small range. It was straight-forward to determine the thresholds for unusually fast responses.

However, adding complexity to the logic made the number of survey paths a respondent could take increase exponentially. Some paths were much shorter than others, which meant that the range of survey completion times widened considerably. “Acceptable” survey completion time was getting hard to define, and consequently identifying speeders was becoming more and more challenging.

I obviously needed to update the cleansing methodology. But did we have the needed data for that?

Breaking and making a survey platform

Long before the survey logic became the multi-layered structure that it is today, the off-the-shelf survey platform we were using failed us. It simply could not handle all the complicated logical rules I had come up with - or the question features I required.

I also needed metadata to update the speeding checks methodology, and the platform we used didn’t offer those. Clearly, questionnaire design and cleansing requirements had surpassed the capabilities of the tool.

Moreover, running surveys was our core business, and therefore it made sense to invest in in-house infrastructure to support it. I provided our tech team with specifications, and we embarked on the fun journey of building our own survey tool.

Finally, the sky was the limit

Having available all the metadata I could think of enabled me to make the cleansing methodology as innovative as I wanted it to be - or rather, as innovative as it needed to be. It also meant that I could quickly adjust it, asking for more metadata as needed to keep up with new types of cheaters - and all the new methods and technologies they used to cheat with.

The resulting cleansing methodology I gradually built over the years is quite sophisticated, so I won’t go into too much detail. But, in Part II of this post, I will give you the high-level idea and tell you how the above challenges were addressed. Stay tuned!