This project used data about software developers from Github, StackOverflow, and Twitter.
This data lets us understand the relative importance of different software developers to their communities, by looking at their relationships, activity levels, and other characteristics which are publicly shown on those sites. We collected this data by making API calls over a long period to get a full set of user and repository profiles, which we could then filter down to those of interest to us (UK-based and popular). This is something of a needle in a haystack problem: we started with 3 million profiles, and eventually filtered this down to 3,000 key users, only 0.1% of the total.
We also drew on Open Corporates’ data about corporate entities, to learn more about the companies where innovators worked.
Our final analysis was framed around the Github user profiles. We took the set of around 3000 UK-based users with over 10 followers each (i.e. in-degree of >10) as our starting point, our assumption being that having a number of peers following one’s work is an indicator of doing something interesting or innovative. We used these to create the counts of ‘innovators’ by city and company, and we analysed the number of repositories in each language for each of these users to create the heat map of popular languages by city.
In order to create the network visualisations, we took the top 20 Github users by in-degree, and then found all of their reciprocal connections with in-degree >7, including those which sit outside the UK. The intention here is to discover how the social networks of innovators cross borders of company and city, and also to ensure that we picked up on any ‘hidden innovators’ who had more ordinary follower counts but who were followed and respected by the more visible members of the group, or who had not publicly stated their locations so would be missed in the initial sweep.
We applied the Force Atlas algorithm to these sub-networks to create a clustered layout: we found that there was a high degree of interconnectedness in most programmers’ networks so there was limited separation into clusters, but that there were still enough interesting patterns to mean that network visualisations were useful and worthwhile.
Data collection and scrubbing challenges
In order to identify UK-only developers, for our ‘popular people’ list, we needed first to filter their Github profiles. Location in Github is a free text field, so it was necessary to parse many different location descriptions, and to identify and remove many false positives - for example, Cambridge Massachusetts, where MIT is based.
We also found that the advertised API call limits were often incorrect, so had to adjust our approach a number of times to make sure that we could collect all of the relevant data.
On the other hand, we found it much easier than expected to join the Github and Twitter data; almost all of the software developers on Github had also been Twitter early adopters and had used the same handle, so we could easily match these together by that unique identifier.
We found that the same matching was true for StackOverflow users. However, we eventually found that the StackOverflow information added very little to our analysis, and made it more complex to understand, so we removed this from our source data.
We had two main analytical challenges: finding a reasonable definition of ‘innovator’; and figuring out a way to calculate and visualise social networks at a scale that was manageable.
For the ‘innovator’ definition, having tried a number of more complex approaches including network centrality calculations, StackOverflow scores, and other weighted measures, we eventually settled on a simple in-degree measure. This has the merit of being easy to explain, as well as conforming to an intuitive understanding of innovation, allowing the ‘collective intelligence’ of the network to determine who it thinks is innovative.
Visualising the social networks was much more difficult. We wanted to create visualisations that let the user explore a network by themselves, but also which provided some immediate insight into an innovator community by using clustering and colouring to indicate relationships and characteristics of the individuals. This proved extremely difficult: both clustering calculations and visualisation become impractical as soon as the network is of the order of thousands or tens of thousands of nodes, but providing only a single network of ‘top’ innovators gave very little insight.
Eventually we settled on creating a separate network for each of the ‘top’ innovators, taking all of their reciprocal connections with in-degree >7 and then the links between those connections. This gave us a set of manageably-sized networks with interestingly different characteristics and themes depending on the ‘seed’ person. It also ensured we picked up on other probable innovative individuals who didn’t have a huge in-degree or who hadn’t flagged themselves as in the UK, but who nonetheless were followed and respected by someone else with those characteristics.
Assumptions and possible biases
The obvious bias of this analysis is that it relies on public profiles, and on users of Github only. The Github element should not be a significant problem, as it is a commonplace tool for software developers. However, the public data aspect, and the use of in-degree as a measure, means that it will inevitably miss people doing innovative work in isolation, or who are using Github but are not sharing their work with anyone else. These ‘hidden innovators’ will remain off the radar in any such public data tracking.
In addition, we have relied only on people’s publicly declared locations and companies. This means that there are probably a number of individuals whose personal profiles are included here but who we haven’t been able to explicitly link to their employer.
Finally, we will have some bias towards users of open source software, since GIthub is the de facto standard for open source code sharing. We don’t believe that this will be a major issue, since most developers even in corporate environments will make use of open source.
There are a number of surprises in this analysis, which should merit further analysis.
The first is how London-centric the innovator count is. This may be inevitable, given that strong communities may only grow stronger. However, we’d expected other towns to also show up in higher numbers.
Also, the limited number of UK commercial companies in the data. There is a strong public sector presence: the BBC, GDS, and University of Cambridge are all in the top employers for innovative developers. There are also a number of US software companies.
Finally, the different network types that we see are also interesting: some innovators are immensely connected and tracking lots of other people, to the extent that their social graph is a hairball that’s very hard to parse, while some follow very little of others’ work.