Last week, Amazon filed its first ever lawsuit over review fraud. It alleges an ‘unhealthy ecosystem’ has developed to inflate the ratings of certain products on the Amazon sales platform.
In this blog post, we’ll look at review fraud and thought about how sellers and platforms can use graph visualization to clamp down on it.
What is review fraud?
Countless reviews are posted to the web every day. Sites like TripAdvisor, eBay, Yelp, Foursquare and Amazon own huge volumes of user-generated review data that sits at the heart of their sales platforms. When used properly, this content is a useful tool – reassuring customers of the product or service’s quality (or, if the reviews are bad, warning them of the opposite).
Review fraud is when individuals or organizations manipulate that user-generated content to their own advantage – creating false reviews to misrepresent their business or competitors.
It’s illegal – lying to customers for sales – and a huge headache for users, businesses and the websites being used for the attacks.
For the websites, review data is their future profit, driving organic web traffic and sales conversions. Review fraud erodes customer trust and damages the integrity of the data. Websites cannot monetize their content if the consumers don’t trust its accuracy or validity.
For the companies being reviewed, there’s a risk of reputation damage and lost revenue. Review fraud paints an inaccurate picture, turning customers away from potentially good businesses and into the hands of less scrupulous suppliers.
As for the users, they are simply left not knowing who or what to believe.
Who commits review fraud?
There are three groups of people that commit review fraud:
- Business owners
- Disgruntled customers
- Black hat ‘reputation managers’
The third group use a mixture of brute force review fraud methods – systematically submitting reviews knowing that a few may slip through the anti-fraud processes – and more subtle approaches, like paying existing members to submit reviews from their own accounts.
Understanding review fraud data
Detecting fraud is a matter of understanding patterns in connections – in this case, connections between people, devices, locations and reviews.
A key difference between review fraud and financial fraud is that review websites don’t always ask for verifiable information, like an address or credit card number. This increases the number of reviews submitted but does make it impossible to crosscheck reviews against a credible watch list.
Instead fraud investigators rely on device data, location data and behavioral patterns, including:
- Review text
- Review submission velocity
- Device fingerprints
- Profile data
- Geo-location data
Identifying fraudulent behavior
To find any kind of fraud, including review fraud, we need to do a few things:
- Identify different patterns of user behavior
- Categorize ‘normal’ behavior and ‘outlier’ behavior
- Define which outlier behaviors indicate higher probability of fraud
Investigators use an algorithmic approach to assign each piece of user-generated content with a fraud likelihood score. Indicators could include:
- Creating a new account with a device that has already been used to access other accounts.
- Creating an account, leaving a single (very high or low) review, never returning.
- Reviewing a collection of businesses in one small area (e.g. all Italian restaurants in Cambridge) leaving a single excellent review and a series of 1* reviews for the rest.
High-scoring content is automatically blocked, low scoring content is allowed, and borderline content is manually reviewed using graph visualization tools, built into the content management platform.
There are plenty of different behavior patterns that could indicate review fraud. These will evolve over time as new techniques are developed.
Visualizing review fraud
In our example, our review data has three entities: the business reviewed (building icon), the IP address used (computer icon), and the device provided (@ symbol icon). Reviews flagged as suspicious use a heavy red link, instead of the default blue. Reviews previously removed as fraudulent show as ghosted red ‘X’ nodes.
One IP address has been used to submit seven reviews about a single business, using four different devices. Three reviews have already been removed as fake.
The timing and shared IP address of the remaining four means they are also likely to be false. If we expand outwards on one of the deleted reviews, we see more clues of a possible attempt to manipulate ratings:
This time, one device has been used to submit eight zero-star reviews about a single business, but using 5 different IP addresses (or, more likely, a proxy IP address).
This visualization approach provides a fast and intuitive way to digest large amounts of data, improving the quality and speed of decision-making.
There are many different ways to model review data, depending on the insight you need to uncover. Below we have simply shown three elements of the data:
- The reviewers account (person nodes)
- The businesses being reviewed (building nodes)
- The review rating (green –> red links)
Again, patterns instantly stand out – including the incredibly positive reviewer in the bottom left who has left dozens of 5-star reviews for many different establishments. Could he be part of an ‘Astroturfing’ network? Looking at the timing of the reviews, and the locations of the businesses being reviewed would give some good insight.
Also of interest is a cluster in the middle:
We need to question why one business has received multiple 1-star reviews from accounts that do not seem to have any other activity – a behavior we have identified as potentially indicating fraud.
These are just two possible ways of modeling and visualizing the review fraud data. Each approach will highlight different aspects and behaviors.
More about our graph visualization toolkits
At Cambridge Intelligence, we help organizations visualize and understand the connections in their data. From fraud detection to cybersecurity and law enforcement, every day thousands of data analysts use tools built with our toolkits to uncover threats and risks that would otherwise go undetected.