Analysis of free software communities (I): a quantitative study on GRASS, gvSIG and QGIS

2011.09.24

This post is part of a series: introduction (I), adoption (II), activity (III), work hours (IV), generations (V), and coda (VI).

When selecting an application, it’s very common to consider technological factors -what the application enable us to do?- and economic ones -how much money do we need? And yet, there is a third factor to take into account, the social aspects of the project: the community of users and developers who support it and make it be alive.

During a series of posts, I’m going to show a quantitative analysis of the communities of three free software projects in the Geographical Information Systems sector: GRASS, gvSIG and QGIS. They are the more mature projects in the space, they are under the OSGEO Fundation umbrella, and they show some differences on culture and organization.

The authors

The results that I’ll show come from a paper I led jointly with Francisco Puga, Alberto Varela and Adrián Eirís from CartoLab, a research laboratory from the University of A Coruña. The results were published on the V Jornadas de SIG Libre, Girona 2010. If you are fluent in Spanish (reading or listening), you can benefit from these resources:

(in Spanish) The complete paper [PDF].
(in Spanish) The slides [PDF].
(in Spanish) Video explaining the highlights

From those who can’t, I’ll summarize the main points through small posts on this blog. The original authors have not reviewed the text as published, so consider any opinion expressed here as my own.

The idea

During the more than 25 years of free software movement, it has delighted us with the high capacity for fostering creation and innovation a community-based model has. Along last years, that model proved its viability in other areas too: content creation (wikipedia), cartographic data creation (openstreetmaps), translating books, etc. Yet, few is known on “how to bootstrap and grow a community”. The only thing we can do is observing what others have done and learn from their experience.

In order to contribute to the understanding on how a community-based project works I’ve work with Francisco Puga and other people from Cartolab to put together some of the public information the projects generate and make some sense from that. The actors in a community interact with each other, and, when that happen through internet, a trail is left (messages to mailinglists have author information and date, code version systems log information about the authors too, …). Basing our work on this available and public information -and standing on the shoulder on giants –i.e: reviewing a lot of research works similar to what we like to build- we have developed a quantitative analysis on the communities supporting GRASS, gvSIG and QGIS.

Public datasets

The first step was to evaluate and gather all the public information a project, for what we like to do it in automated way. But, as we had to compare the 3 projects, the data had to be homogeneous: at least exists in both 3 and be in a comparable format. Taking these constraints into account (and the limited time we had for this!) we have collected information from 2 different systems:

Code versions control systems: from every project, we cloned all information available in their repositories to a local git repo, in order to parse the log of changes. This allowed us to study all the history of projects, from the very begining to December 2010.
- Data was got from trunk: grass – gvsig – qgis.
Mailinglists: by means of mailingliststats tool -built mainly by our friend Israel Herráiz, thanks bro!– we gather data from March 2008 to December 2010.
- Development mailinglists: grass – gvsig – qgis
- Users mailinglists: grass – gvsig en / gvsig es – qgis

Some disclaimers:

Projects have a number of branches, plugins and so. We focused the study on the main product, what an user get when she downloads it. Further study on the plugins ecosystem is needed, and it will give us more fine-tuning information.
Projects have a number of mailinglists more than we have studied (translators, steering committee, other local/regional mailinglists, etc), varying on each case. The analysis was focused on developers and users ones due to we think they are representative enough to mark the trend. We are not interested in giving an exact number (which may be impossible to measure!) but in drawing the long-term fluctuation of participation. Our intuition and past experiences, says that those mailinglists will follow a correlation of participation with the larger community surrounding the projects.
In the particular case of gvSIG users mailinglists, we have studied spanish and english mailinglist jointly. It makes sense doing so as the spanish mailinglist still have the core of contributions from hispanoamerican countries and non-spanish people interacts through international mailinglist. It is like the project have two hearts.
Unfortunately, quality of data have limited the period in study: the range is from March 2008 to December 2010. Prior to that, not all projects have information due to mailinglist migrations.

What is it useful for?

It’s possible to analyze a community from a variety of points of view. Our approach is a quantitative focus by means of a common model which agregate users depending on their level of participation:

Leaders: those who build the product and make the decisions.
Power users: those who adapt it to their needs and using it intensively.
Casual users: those who using it for a concrete task.

This approach allow us to better understand the size of the community and how they interact, as it’s not the same the value provided by someone who in 6 months only sent 1 mail to a mailinglist than other person who spent that time sending more than 100 patches to the code.

With these constraints, we managed to built the following indicators:

Adoption trend within users and developers: based on mailinglists data.
- Status: post published.
Activity and manpower: based on code contributions (commits).
- Status: post published.
Composition of the community: based on code contributions (commits).
- Status: post published.
Generational analysis: based on code contributions (commits).
- Status: post published.

During next weeks, I will be publishing the results of the study, in order to help us to understand how different free software communities work, and what we can learn from that.

Comments

12 responses to “Analysis of free software communities (I): a quantitative study on GRASS, gvSIG and QGIS”

Markus Neteler

25 September, 2011

Hi,
a quick feedback: in table “Tabla 3: Top 10 desarrolladores – GRASS” the committers “markus” and “neteler” are the same person… that’s me. In a future version of the document, maybe put it together
into one line as “markus|neteler”.
cheers
Markus Neteler

Reply
1. Andrés
  
  25 September, 2011
  
  Yep, we supposed it. Your case is not the only one, though, but we couldn’t find the time to research this in more depth (for example: asking the own users, matching the mails, …).
  
  Reply
Markus Neteler

25 September, 2011

A comment concerning the GRASS GIS repository. Of course it is a fact that the first version was published in 1984. But since no civil internet existed nor any distributed versioning system, it is only traceable back till 1999. We decided to put GRASS into CVS the day before the famour “year 2000” bug… So slide 4 of your presentation should be corrected (likewise the document). See also http://wiki.osgeo.org/wiki/Open_Source_GIS_History

Reply
Markus Neteler

25 September, 2011

The “user trends” of just 2.x years (2008-2011) are too short for multi-year projects. Find the mailing list statistics since 1999 (note that the GRASS lists were started in 1992!) here:
http://markmail.org/search/?q=qgis
http://markmail.org/search/?q=grass%20gis
http://markmail.org/search/?q=gvsig

Reply
1. Andrés
  
  25 September, 2011
  
  Oh, what an amount of data for a research-junkie as me 🙂 I’ll compare that to ours findings. Thanks!
  
  Reply
Cameron Shorter

2 October, 2011

Hi,
I’m fascinated by studies such as you have described, as users are regularly asking us at LISAsoft about recommendations on which Open Source project they should use, and I’d love to be able to base my response upon some solid metrics.
In particular, I’d love to be able to point people at metric results for all the 50 odd projects which have been included on the OSGeoLive DVD. http://live.osgeo.org
On a related note, I’ve written a more subjective description about the keys to success building the OSGeoLive community here: http://cameronshorter.blogspot.com/2011/06/memoirs-of-cat-herder-coordinating.html
Cameron Shorter

Reply
Barend Köbben

6 October, 2011

Putting the Spanish and English lists of gvSIG together is basically cheating… You should have included all non-english lists for all the softwares.

Reply
1. Andrés
  
  6 October, 2011
  
  Barend, I don’t think so. The indicator try to measure the trend, not the exact number. It’s very wrong to see it as it was the user base, which I suppose was your point. If we tried to do the later, summing both lists will be very inappropiate, as you suggest. But if you try the former I think it makes sense, as the community is splitted in both spanish-speaking and english-speaking (which no happens in the other projects). Basically, the project has 2 hearts with activity, and the tendendy in one place can affect to the other.
  Although measuring all mailinglists would be the ideal situation, we couldn’t afford that.
  Nevertheless, the tendency agregating both or taking into account the lists separately is the same, so it supports our initial guesses.
  
  Reply
Analysis of free software communities: coda – oandre.gal

18 February, 2017

[…] you can see in my last posts (I, II, III, IV and V), I finally managed to translate the paper we released last year in V jornadas […]

Reply
Analysis of free software communities (III): activity – oandre.gal

27 April, 2023

[…] post is part of a series: introduction (I), adoption (II), activity (III), work hours (IV) and generations […]

Reply
Analysis of free software communities (IV): work hours – oandre.gal

27 April, 2023

[…] post is part of a series: introduction (I), adoption (II), activity (III), work hours (IV) and generations […]

Reply
Analysis of free software communities (V): generations – oandre.gal

27 April, 2023

[…] post is part of a series: introduction (I), adoption (II), activity (III), work hours (IV) and generations […]

Reply

oandre.gal