Category Archives: General

Google Flu Trends

This is a fascinating demonstration of how search data can be taken advantage of to extract all sorts of information. Google Flu Trends uses search data to build up a picture of flu epidemics in the US. It makes me wonder what other sort of trends can be inferred from search data. Presumably lots of economic indicators could easily be extracted from search data in a similar manner – things like consumer confidence measures, measures of economic activity on a sector-by-sector basis, or other market indicators. The Google database is incredibly large, essentially cataloging the entire internet and people’s search histories. There must be a lot that can be extracted from this vast amount of information.

It’s been a good week

Two good things have happened in the last week. First, Obama won. But more importantly, this week I got my iPhone. I’ve just ordered a book on iPhone programming so I can delve into writing my own apps. I’m also keeping an eye out for the first Android phones (Android is Google’s new platform for mobile phones), which looks to be a very exciting (and more open) platform too.

Home sweet home

I made it back to Australia in one piece and have just spent a relaxing week on the Gold Coast (picture below). Now I’m in Brisbane to catch up with my friends and attend my PhD graduation (although it’s been over a year since I submitted my PhD). I seem to have perfected the art of overcoming jet-lag, as I had virtually no down-time this time around, which I was very pleased with. I’m also particularly pleased that I saved $400 on my new MacBook by buying it duty free at Changi Airport in Singapore. After my graduation I’m heading back to my home town of Armidale, in northern New South Wales, to contemplate the next step.

Farewell physics

Starting next week I’m making a departure from physics, and academia as a whole. There are several reasons for my decision to depart, which I want to share, as I think the problems I have with the academic system don’t just pertain to me, but are very generic problems that affect many people in the academic community.

The academic system has some serious problems. Most notably in my opinion, there is very limited scope for promotion. For every permanent position there are countless postdocs competing for that position. It simply isn’t possible for all of us post-docs to progress right up through the ladder. Many of us will be stuck as postdocs for the indefinite future. Realistically, I could expect to spend the next 5 or even 10 years as a post-doc before a permanent position would come along, and even then I would have very little control over where I would end up. I’ve seen many outstanding colleagues in exactly this position. This is unlike the private sector, where in virtually any industry there is a well-defined roadmap for promotion, which can be achieved if you’re good enough.

There is a huge salary discrepancy between academia and the private sector. With the same qualifications one can earn twice as much in the private sector than as a post-doc.

So from a financial and career progression point of view, academia is not especially competitive. Of course, academia has lots of fantastic advantages, which is why so many people choose it, despite its shortcomings. Most notably, in the private sector you will almost never have the degree of intellectual independence that academia offers.

In recognition of these problems, I realize that academia probably doesn’t represent a sustainable long-term career path for me. So the question is when rather than if I will change career. If I am going to change career then it’s far better to attempt to do this while I’m young, rather than wait another 5 or 10 years, after which it would be much more difficult to switch into a new career.

What are people’s thoughts? How can the academic system be improved? How can these problems be remedied? Can academia be made more attractive to people in the early stages of their careers?

PageRank in academic publishing

The standard measure scientists use to judge the importance of scientific papers is a simple citation count. That is, how many other papers cite the paper in question? While this measure has its merits, it has one fundamental flaw – not all citations are equal. For example, if a paper I wrote receives a citation from a highly influential Nature paper, that should carry more weight than a citation from the New England Journal of Who Gives A Crap. So what the scientific community needs to do is embrace a better measure that takes into account the importance of a citation. Numerous authors/bloggers have advocated using a PageRank-like index for quantifying the importance of papers or journals (here, here, here and here). In this article I’d like to throw my support behind this suggestion. I’ll begin by explaining the PageRank index and how it is calculated, and discuss why this approach is superior to a simple citation count.

The PageRank index was developed by Larry Page and Sergey Brin as a way of indexing the relevance of web-pages for their Google search engine (incidentally, PageRank is named after Larry Page, not web-page). Here I’ll describe the PageRank algorithm in simple terms, ignoring many of the details. For a more detailed explanation I recommend the Wikipedia entry. The calculation begins by representing the web as a graph. A graph is a mathematical object consisting of a bunch of vertices (dots), arbitrarily connected by a bunch of edges (lines). So a graph can be thought of as a completed dot-to-dot game. Here’s an example of a graph that I pulled from the Wikipedia.

To represent the web we use a directed graph, where the edges carry a direction. In the graph shown above the circles (i.e. vertices) represent web-sites, and the edges represent links between them. So, for example, ‘A’ might represent, and ‘D’ might represent In this instance, the directed edge from D to A means that there is a link on to

The goal of the PageRank algorithm is two-fold. We wish to construct a measure of relevance that, first, is related to how many incoming links a site has, and second, what the importance of the source of those links was. So, edges can be regarded as a ‘vote’ for a site, but the impact of the vote should be proportional to the importance of the source of the vote.

The way PageRank calculates this is by iteratively applying ‘vote casting’ between sites. We begin by initializing every vertex with the same score, say 1. We then iterate through the vertices and pass PageRank points to the sites linked to by the respective site. The number of points cast is given by the originating site’s PageRank, divided by the the total number of outgoing links. So the weight of the votes cast is proportional to the PageRank of the site casting the vote. For example, in the graph shown above, F would cast 3.9/2 votes to B, and 3.9/2 votes to E. While C would cast the entire 34.3 votes to B. By repeating many such iterations, and renormalizing the scores so as to prevent blow-out, the PageRanks of the sites converge to constant values. What we are left with is a set of scores that reflects not just how many sites link to a given site, but also what the importance of those links was. [An alternate way of thinking about the PageRank index is that it can be represented as a coupled system of flow equations, where the ‘flow’ represents the flow of votes. The PageRanks are given by the steady state solution to this system of equations.]

So this is how PageRank evaluates the relative importance of web-sites. What about scientific papers? Well scientific papers can be mapped to a graph in a similar way to web-sites. Specifically, vertices in the graph would represent papers, and edges citations. The PageRank algorithm can be applied out-of-the-box.

There are a few tweaks and variations that one could apply to a paper-based PageRank index. First of all, one could discount self-citations from the index (i.e. an anti-‘bombing’ mechanism). This isn’t possible with web-pages because the authorship of web-sites is typically not public. But, with scientific papers there is always an authorship list attached to every paper. I think this tweak is an important one since self-citations inevitably bias citation counts. For example, when I look back at the citation counts for my own papers I see that my very first paper has by far the most citations. 90% or them are self-citations 🙁 Should these citations really influence a credible measure of the importance of that paper? Probably not, since they are clearly biased. A second variation that one might try is to add a time bias when calculating the index, such that links from more recent papers carry more weight than from older papers. This might make a valuable secondary index as it would reflect how important a paper is presently rather than historically. Again, this is something that is not possible with web-sites, since they are not time-stamped, but it is possible with papers.

In summary, a PageRank-based measure for the impact of scientific papers would address the problem that some citations are more valuable than others. It would more closely reflect the impact that a given paper has had on a field than a simple citation count. What’s left is for someone to set up a site calculating the PageRanks of papers and making them publicly available. Google?

Nürnberg Airport

I just returned from a one week trip to Erlangen in northern Bavaria. At the airport I encountered the most bizarre security loophole I have seen for a while. Immediately after passing through the security clearance where knives are confiscated, you a greeted with a shop selling a broad selection of Swiss Army Knives. I was curious and approached the cashier. She assured me that these knives were okay because the length of the blade was within legal requirements. That may be so but it doesn’t change the fact that these were regular, full size Swiss Army knives, more than ample to stab someone or slit someone’s throat with. It’s an interesting contrast given that at most other airports even toenail clippers are confiscated.