How to prevent fraud in online data collection
Scammers, beware! Cornell researchers created a multi-step protocol to detect and remove fake data created by bots and humans attempting to enroll in online research studies, thus preventing biased results and unwarranted monetary compensation to bad actors. The protocol is the first specifically designed for data collected in rural communities, for which existing filtering protocols are not adapted.
Researchers developed the protocol when a health improvement research study was forced to move online during the pandemic. The study, in collaboration with Texas A&M AgriLife Research, was designed to encourage adults to change their health habits and make their community environment more supportive of healthful behaviors, and relied on volunteers to share their body weight, waist circumference, and other health information annually for four years.
“When the study moved online, we became much more reliant on online recruitment and data collection techniques,” said Dr. Karla Hanson, professor of practice in the Department of Public & Ecosystem Health, and first author of the study, published Nov 9 in Methods and Protocols.
Things went awry when the researchers noticed a sudden, large uptick in enrollment attempts. “It went from just a few individuals per day, to hundreds overnight,” Hanson said. “It’s implausible in a small rural town that several hundred people would enroll in our study all in one night.”
To combat the issue, the researchers first removed any enrollment attempts that came from IP addresses outside the geographic study area, which filtered out 25% of the enrollment attempts. However, this and other traditional automated techniques to remove fraudulent entries were insufficient for this study setting. “We knew basic techniques, but none of them focus on rural areas specifically,” Hanson said. “Some needed to be adapted to our population.”
For example, another classic filtering tool limits enrollment to one person per IP address. But in rural settings where internet access is limited, many people in a household may share the same computer or use a public computer at a library, Hanson said. “To have a representative sample that was economically diverse, we needed to adapt that limitation.”
After using automated tools, Hanson and colleagues turned to manual techniques, checking all submitted addresses against a postal database. “It was very time consuming and expensive to do all these active validation tests,” Hanson said. “And at each step, we found more fraudulent enrollment.”
Monetary compensation offered to study participants attracted bots trying to get compensation and led real people to try to enroll multiple times using fake identities. “When we called, sometimes people had no knowledge of the study, so they were considered fraudulent attempts and they were excluded from the study.” Hanson said. “In some cases, the phone number did not even exist.”
Ultimately, they found that 74% of the attempts were fraudulent. They also discovered that some screening criteria could be overzealous and exclude real participants. For example, some people who seemed to be legitimate participants reported a weight with a hundred pounds difference between year one and two of the study. In those cases, the team verified the data over the phone. “There is some caution to have when labelling a participant as fraudulent, some people do really lose a lot of weight,” says Hanson. “There are also people who typed their weight wrong and we wanted to have a conversation with those participants and understand what was going on.”
Similarly, some real participants entered a different date of birth on consecutive years. The team found that over 40 of these cases were real participants, some of whom provided a fake date of birth due to concern of identity theft. “We didn’t trust people, but forgot that they, too, were suspicious of us,” said Hanson.
While the published paper makes their multi-step protocol accessible to other researchers, it also enables AI to learn about such screening techniques and trick future fraud detection systems. For this reason, the paper’s authors describe categories of filtering techniques, but not the exact details of each approach. “There will always be this ongoing race to keep ahead of the bots,” Hanson said.
Nevertheless, Hanson believes the benefit of sharing these tools with other researchers outweighs the cost of releasing their findings publicly. Ultimately, Hanson said, while automated techniques are useful in reducing the time spent actively reviewing enrollment data, they are insufficient. “We need the human-to-human interaction with participants to ever be sure
Written by Elodie Smith