Thursday, April 28, 2016

Detecting Markov Bots on Reddit, my lazy way

I've been getting annoyed at the number of comment bots on Reddit that are clearly just a program running a Markov chain. If you're not familiar with Markov chains, the basic explanation is that it looks at all the comments and makes a list of what words came directly after what other words, and then goes through the list to find a string of words where each pair was found in that order in the real comments. There are other ones that use larger portions of the text and ones that could use more than just the comments on one post, but the Reddit bots don't seem to be those (Wikipedia has an article with a far better explaination about Markov chains). Sometimes this makes a decent looking comment, but other times it just makes gibberish. There's a whole subreddit, /r/subredditsimulator, where the only posters allowed are Markov chain bots trained on other subreddits. That's the only place I'd like to see them.

I'm going to forward this post by saying that I didn't look around to see if it was done already. It was a quick spur of the moment project I did between homework assignments, and I just wanted to share it. I'm also definitely not a Python expert, so I probably did some things that are silly to seasoned pros.

I noticed a few patterns with these bots. They frequent larger subs (so they'd have more comments to base their own off of), they only use comments in that post (to look on topic), and they typically are sharing an account with a second bot that rips comments off of Imgur (there used to be a Reddit account with a bot running to call out stolen Imgur comments, but I haven't seen it in a while).

Example bot comment made by mashing up two existing comments

So, I decided to write up a quick script to scan for Markov chain bots.

My original plan was to generate my own Markov chain and see if each comment was possible to make with it (but not allowing it to decide the comment was copied from itself). Then I realized the lazy way. Every pair of words next to each other in the bot's comment also appears in someone else's comment. So, loop through all the comments, compare every pair of words to every pair of words in every other comment, flag comments where every pair has a match.

It's not perfect, it's not optimized, but it works. Run it, input a Reddit post ID, and it'll look through the comments for you. The only issue I've found is false positives (although sometimes I think it's one and it's actually a bot that got lucky) when a lot of people have very similar (or the same) comments. There could be false negatives, but that's going to be difficult to find out. There's a minor issue that Reddit's API is rate-limited to one request every 2 seconds (so long comment sections take a while to load), but that's not going to be something you can really work around.

Here's the code (you'll need Python 2.7 and praw installed):

Here's an example of it running:

Some of these look a lot like false positives, but if you go to that thread (/r/SandersForPresident was chosen because it has a lot of these bots in it, not due to politics), you can look up parts of those comments and see how they came together.

An improvement I've thought of that might help with false positives is to only compare a comment with comments that were posted before it. The bots never edit their comments, so their original comment is going to be based on what was already there at the time they commented.

No comments:

Post a Comment