Archive for December, 2011

Using usertracking to figure out your users interest based on the content they like

Tuesday, December 13th, 2011

This post will probably be long, and describe everything I did in my final project together with Berlingske Media. But don’t worry, I will (try!) to give you something nice to look at while you are reading.

Berlingske Media logo

Introduction

Berlingske Media is a big media company and therefore they also has a lot of different products, different parts of the company have different ideas and projects. Offline they sell newspapers, and online they supply people with “news-stories”. The thing their offline and online products have in common is that they sell advertisement for their own webshops or for third-party advertisers.

However, like all newspaper companies they are feeling the changing market, and they don’t sell as many newspapers as they did 10 years ago, so they want to maximize profits. People still want to read news, they just don’t want to buy a newspaper and they want the news for free on their electronic gadgets.

As a result to that, Berlingske Media needs to keep advertising, on their pages to bump up the profits. They also need to keep the number of page views up in order to show more advertisement.

This explains things rather well :) :

Formula for making money (no pun intended ;)

But how on earth are you going to increase page views?

This was our big challenge, and we couldn’t touch any existing platforms. Berlingske Media runs most of their high-traffic websites on a single multisite Drupal installation which is named BOND (Berlingske ON Drupal). This system is very business critical and not something that we want to break or cause extra loading times on.

Berlingske Media have A LOT of data across their sites and this is everything from articles about celebrities, bicycling team pages, interviews with bands nobody heard of, and a lot of localized articles that only locals care about.

Our idea was to make a system that could learn from the content, and group it into categories. The system should also be able to track the user across all the webpages and learn about habits and interests. We would also like to know about the geographic-location of our user.

By combing our knowledge about the specific users interest and all the content that we have in the library mixed with the current location of our user we could match him to very specific content to his or her likings.

We was thinking about the user-related tags to be illustrated like this:

(Bigger font-size, more relevant)

 Automatic tagging of content.. you asked for it!

We were unable to tamper with the big backend system, so we had to figure out a way to take a URL and output related tags.

Oh.. http://www.sporten.dk/fodbold/kvist-og-eriksen-i-duel-om-ny-pris I know him, this relates to football, eriksen, sport

The reason why we couldn’t just dive into the backend and fetch tags for the article, is because some articles is very sparsely tagged, while others isn’t tagged at all. So we needed to find a method to extract keywords out from all types of articles on all BOND-enabled sites.

Our first very primitive and naive approach was very simple. You notice how written text can contain a lot of fill-words? Like: this, is, then, into, the, a etc.
What if you strip out all the fill-words, you would have keywords about the context, right? Absofuckinglutely not! This is one of our results:

Number of occurrences Word
443393 i
319530 og
315928 at
275963 det
264927 er
226317
223646 til
204339 en
167678 for
167059 har
166912 med
142895 den
138687 der
133971 ikke
127229 han
126021 af
125333 jeg
124466 de
18558 vm

This clearly shows that you are able to get very wrong keywords based on journalists hitting wrong keys, and using different spelling for each word. Also things like names is very difficult to take into account.
Morten Eriksen and Morten Hansen is two different people but your parser should be able to see them as two different distinct persons.

Our end-solution

We decided to write the application in PHP and initially using a MySQL as a database, I say initially because we weren’t exactly sure that it could handle the load of all the users.
It could be interesting to write it in Node.js or Python, but because the sys-admins at Berlingske Media is used to handle PHP, we chose that as our solution.

We did split the project up in two parts:
Tracking-part – Responsible to track users across websites and tag them appropriately
Content-part – Responsible for telling the website about their visitor, and suggest content based on our knowledge

I will now describe the two parts in detail, and try to explain what they do.

Tracking-part

The tracking system is responsible for tracking the user by identifying a cookie, or setting a new one. To be able to track the user across webpages we use a “web-beacon”, this is a PHP-script that returns a 1×1 pixel sized image.
Let’s call our script image.php, and place it in the root of trackcloud.dk, on all site that we need to track, we will place this HTML-code:

<img src=”http://trackcloud.dk/image.php” />

Every time a user visit one of our logged sites, the browser would render the page, request our “image” and send us nice headers that we can use, and we send a delicious cookie back to the browser.

We also look up the users IP adress and tries to look it up in a IP2Geolocation database, it is possible to get semi-accurate readings in Denmark. If we are able to get a geolocation result, we also tag that to the information about the user.

After sending the cookie, we are able to identify the user again when that browser tries to request the same image again.

Content-part

The content part is where it gets fun, now we have information about the user, and we just want to have fun. Our API is designed to be used with jQuery, and this is where the lovely use of the JSONP trick is used!

The JSONP-trick is a way to get around same-origin-policy that all browsers uses, and it enables us to communicate JSON from one host to another over Javascript. The nice thing about JSON is that it parses directly into Javascript, so you don’t have to do any complicated things to work with the returned dataset.

Our API enables the following calls:

Call Description Example of data
location Returns the geolocation of the user ?({“location”: {“countryname”:”Denmark”,”regionname”:”Hovedstaden”,”cityname “:”Hellerup”,”latitude”:”55.7333″,”longitude”:”12.5833″}});
visitedsites Returns the sites this user have visited ?({“visitedsites”:["bt.dk"]});
articletags Returns the tags from the articles the user have viewed ?({“articletags”:[“bt”,“hunde”,“katte”]});
populararticles Returns articles that relates to the tags of the user ?({“populararticles”:["http://bt.dk/artikel/287","http://bt.dk/artikel/ 370","http://bt.dk/artikel/543","http://bt.dk/artikel/718","http:// bt.dk/artikel/68"]});

The reason why we have the call for visited-sites is that at Berlingske Media we have several sites hooked up to this system. And we want to check if a user have visited a specific site, for an example: If a user have visited our site about dogs, we want to be able to know it, so we can show him or her dog-related content.

Performance, performance, performance…

Okay – it wasn’t all fun and games, our system needed to perform well. Because of secrecy I can’t tell you how many requests pr. second we needed to serve to meet our goal. Let’s just say that we should be able to serve more than 2000+ requests pr. second.

We switched out Apache in favor for NGINX. We first tried with Lighttpd but ran into some serious issues when we stress-tested our system, so we switched to NGINX and never looked back. We used PHP-FPM to serve PHP to NGINX, and saw the number of requests we could serve go up.

We also needed to use caching, so we installed APC (Alternative PHP Cache) for making things speedy. APC also allows you to cache variables in RAM, but we needed to be able to also cache things to Memcache if our system should be running across multiple webservers. For that we used the PHP library Stash, which allows you to specify more than one cache-engine and also does things like stampede protection for you.

During our development we set up a box with XDebug and XCachegrind so we were able to see what our code was using its time on, and figuring out where our bottlenecks were.

After revising our initial MySQL queries with the EXPLAIN utility, and adding a few indexes, we also managed to make our database access faster, and we ended with being able to serve more than 2500 requests a second!

So this system of yours, what can it do for the business?

In general it can learn about user habits, and suggest better reading material to the user. So it is way easier to build blocks that relates to user interest. Bt.dk have a lot of these boxes:

Instead of filling them with latest news, or something other irrelevant, it could be nice to fill them with articles that relates to user interest. Our system also have built-in support for geotargeting, and it could also be relevant to show articles with stories that happens near the user.

Berlingske Media also owns a site called lobnu.dk (Løb nu = Run now) and when you use the site, it will show you the running routes all over Denmark that was updated the last time. Instead we propose that the user will be shown running routes near him/her like this:

 

 

What did we learn?

We learned that it is important to set up XDebug early on in the project, if you know that your system should be able to perform very well under stress.

We also learned that it is very important to communicate during a project like this, because i spent 2 weeks in Ukraine together with the developers from Berlingske Media in Kiev.

Building a chat with nice tabs in jQuery

Saturday, December 10th, 2011

I am currently trying to find an excuse to work with NowJS which is a very cool NodeJS server that makes it possible to make a server push to connected clients. You are able to group your users into “channels” and only send some groups messages.

I have only been able to build the client, and the javascript that goes with it. It is still very hacky (both css and javascript) but the core concepts is there.

You are able to see a demo here: Demo chat preview

If you were active on the danish chat ‘Jubii Chat’ back in the day, you would be able to recognize one of the users ;)