William Lyon Software, technology, startups, etc...

Twizzard, A Tweet Recommender System Using Neo4j

Twizzard mobile screnshot

I spent this past weekend hunkered down in the basement of the local Elk’s club, working on a project for a hackathon. The project was a tweet ranking web application. The idea was to build a web app that would allow users to login with their Twitter account and view a modified version of their Twitter timeline that shows them tweets ranked by importance. Spending hours every day scrolling through your timeline to keep up with what’s happening in your Twitter network? No more, with Twizzard!

System structure

Here’s a diagram of the system:

Twizzard system diagram

  • Node.js web application (using Express framework)
  • MongoDB database for storing basic user data
  • Integration with Twitter API, allowing for Twitter authentication
  • Python script for fetching Twitter data from Twitter API
  • Neo4j graph database for storing Twitter network data
  • Neo4j unmanaged server extension, providing additional REST endpoint for querying / retrieving ranked timelines per user

Hosted platforms

Being able to spin up hosted instances of our tech stack simplifies the process of putting this project together.

  • GitHub - It goes without saying that we used GitHub for version control, making collaboration with my teammate who handled all the design work quite pleasant.
  • Heroku - Heroku’s PaaS for web hosting is great. Free tier is perfect for getting small projects going.
  • MongoLab - This was my first time using MongoLab’s hosted MongoDB service. No complaints, worked great! Also free tier was perfect for getting started.
  • GrapheneDB - GrapheneDB provides hosted Neo4j instances. These guys are awesome! Their service is rock solid. I can’t overstate how impressed I am with what they provide (they even allow for running custom server extensions!)

Getting started

I stumbled acros this node.js hackathon starter template a few months ago. I decided to give it a try for the first time this weekend. It puts together a stack that I’m familiar with: node.js, mongoDB, Express, passport for handling OAuth, Bootstrap and jade. It’s a great starter template for, as the name implies, starting hackathon projects.

Graph data model

Since this project deals with Twitter data, modeling that data as a graph seems intuitive. We’re concerned with users, their tweets, and the interactions between users. The data model is pretty simple: Twizzard graph data model

Inserting Twitter Data With py2neo

Once a user authenticates to our web application and grants us permission to access their Twitter data, we need to access the Twitter API and store the data in Neo4j. We accomplish this with the help of the Python py2neo package.

Ranking tweets

How can we score Tweets to show users their most important Tweets? Users are more likely to be interested in tweets from users they are more similar to and from users they interact with the most. We can calculate metrics to represent these relationships between users, adding an inverse time decay function to ensure that the content at the top of their timeline stays fresh.

Jaccard similarity index

The Jaccard index allows us to measure similarity between a pair of users. For our purposes this is defined as the intersection of their sets of followers divided by the union of their sets of followers. This results in a score between 0 and 1 representing how “similar” the two users are to each other.

$ J(A,B) = \frac{A \cup B}{A \cap B} $

Calculating this in Neo4j Cypher for all users in our database looks like this:

Interaction metric

Another important factor to take into account is how often Twitter users are interacting with each other. Users are more likely to be interested in tweets from users they interact with often. To quantify the strength of this relationship for a user pair A,B, we simply divide the number of A,B Twitter interactions by the total number of interactions for user A.

Weighted average

The similarity score and interaction score are combined using a weighted average. We weight similarity slightly higher than interaction.

Time decay

To ensure temporal relevence, an inverse time decay function is used to discount tweets according to the amount of time elapsed since the tweet was sent:

$ TweetScore = \frac{WeightedUserScore}{elapsedTime^2} $

By storing these data and relationships in our Neo4j instance, we simply need to query the database for the highest ranked tweets from our web application and display these to the user.

Extending the Neo4j REST interface

Neo4j Server provides a REST interface that allows for querying of the database. Neo4j also allows for adding unmanaged server extensions that allow us to extend the built-in REST API and add our own endpoints. We can do this using JAX-RS in Java. In this case, we write a simple server-extension that adds the endpoint /v1/timeline/{user_id} that will execute a Cypher query to return the tweets for the specified user’s timeline, ordered by the ranking metric we’ve defined above.

Twizzard

We had the site up and running by the time final presentations were scheduled. In fact, we even had time for a few iterations based on user feedback. We called our web app Twizzard, as in Your Twitter Wizard (but with two z’s, like a Tweet Blizzard). It’s running now at twizzardapp.com. Sign in with your Twitter account and check it out.

Twizzard screenshot

Shout out to my Twizzard teammates @kevshoe and @VisionaryG. It was great fun working with you guys at Startup Weekend Missoula 2014!