(Geo) Database evolution while developing

Nov 11, 2012

—

During last year, I followed with interest the different approaches on how to evolve the design of a database being discussed within the postgresql community. Following is my take on that one: how this year I developed a project with an intense evolving DB design using an agile approach.

The context

My requisites for this project were twofold:

An evolving DB design: at the beginning of the project I didn’t know how the DB design was to going to be. I had set to use some advanced techniques for data modeling which never had used in production (dynamic segmentation and linear referencing with PostgreSQL/PostGIS) and needed an approach which supported my evolving understanding of the domain.

Intense collaboration with analists: the project needed some intense work on data-processing to polish and create the data for the application. I knew this was to be an iterative process where both developers and analists would collaborate together to define and clarify the model we needed.

My approach

So, in the process of improving and automating my delivery pipeline, I set some rules for the project:

DB management through SQL and control versioning: the database was created from DDL scripts and data was stored as CSV (if alphanumeric) or SQL (generated from Shapefiles to store geographical information).

Application and database evolve together: so their code should too, which in practice means I put the app and DB directories/projects under the same git repo.

Test driven development: I needed to break the problem in small chunks I could deal with, while my understanding of the domain improved. Besides, when refactoring the DB (schemas, triggers, functions, etc) -which happened frequently- I needed to know all the pieces were working OK. I decided to use pgTap for that.

And how it turned out?

The pipeline worked smoothly: both the analists and developers were working in their confort zone with the proper tools; desktop GIS applications the formers, command-line and SQL the laters.

git provides an excelent mechanism for versioning text, so I had powerful tools at hand for versioning SQL structure and data (diff, cherry-pick, interative rebases, etc). Besides, see where the data was varying (name and type of fields, its values, etc) allowed us to early discovered some bugs and problems.

Database and application evolving to the same pace. By tagging the versions we can build in seconds the binaries needed for any version of the application with the proper DB.

Tests at DB level are a life-saver. pgTap allowed me to refactor the database whith no risk and a lot of confidence on what I was doing. I had all kind of tests: check if a trigger is launch if an UPDATE happens, a function is working, data integrity and model validation after the initial restore, etc.

Same process for deplying to developing, staging and production environments, which resulted in fewer errors and no panic-moments.

Having the data in the repo and regenerating BD from scracth was very comfy and quick (less than a minute in my laptop the whole DB: 100Mb of raw SQL) and similar numbers when deploying to stage through the wire. In a daily bases I only had to regenerate specific schemas of the DB, so waitings was an order of seconds.

Coda

We should consider the database as other deliverable to our clients and set the same quality, standards and methodology to develop it. In that sense, agile philosophy and practices match very well with the DB evolution.

At the tools level, I was reluctant to introduce new tools and steps I didn’t know very well in such a tight schedule, so I decided to stick to the basic and spartan (git repo, shell scripts, pgTap and SQL), then iterate and grow a solution for our specific case. Although I missed some refactoring tools, it turned out to be a good approach and now I´m in good position to know the tradeoffs of the process, which in next projects will help me to choose a specialized tool, if necessary.

Comments

2 responses to “(Geo) Database evolution while developing”

fpuga

25 January, 2013

Is there any reason to use SQL and CSV and not another format, like dbf and shp, that are more natural to the gis technicians. As i see the transformation of the shp to sql is being doing by hand and “out of the control versioning”

Reply
1. Andrés
  
  25 January, 2013
  
  Yep, there is a reason: both SQL and CSV are “plain text”. That means that you can track changes and see them with git (diff, cherry-pick, etc) and other console-oriented tools! A silly example: if a field of a shapefile is changed from integer to string you see that inmediately or you can even create a test with pgTap to ensure it is an integer. That was one of the things I loved the most.
  Yep, I created the data from original sources (Shapefiles and so) with a script (also in the repo). So my pipeline was “Data in repo (CSV and SQL) > build process > Proper DB for the project”. Streamline the creation of SQL from SHP into the pipeline would be inmediate and faily trivial, but I guess I did not integrate it at the time because I felt more comfortable using just “plain text”. But it could done, sure.
  
  Reply