As you start working through the ‘What Works Cities’ program, you quickly come upon the requirement to build a ‘data dictionary’. (TLDR?: Check out the sample app and the install instructions)
Drilling down into the documentation on the GovExLabs site, the stated goal of said inventory is:
The first step to treating your city’s data as an asset is to create a comprehensive data inventory with consistent metadata. Knowing what data your city collects leads to efficiency, and increases accountability. It also eases citywide reporting, decision making, and performance optimization.
As we started opening up the examples provided at GovExLabs, almost all resort to a spreadsheet/table format to catalog the data. This is a great start; at least you now know what data you have. If you’re sorting through your ‘thousands of datasets’ mentioned in the documents, it seemed to me that such an overwhelming amount of data would be hard to work with in table form. Ideally, I’d like to be able to visually browse the inventory so I could more easily uncover relations. Even better, if I have a report in mind, I’d like to see both what data is available and how those pieces of data relate to each other.
In the story in my head, I’m a data analyst whose been asked a question like: ‘Can you show the correlation between water usage and work orders on water lines as a heat map by neighborhood and if it’s changed over the last decade’. I can see myself thinking:
Well, great, well I know the Utility Billing system has that data, and it has addresses… do I need to relate that to parcels? That’s GIS… okay, and work orders… are all the work orders in with the utility accounts? Wait, didn’t they go to a new billing system a few years back?
That thought brought me to put together this first whack at a visual way to manage and explore the inventory.
Terminology
Let’s define some terms used throughout the application:
- Data Source: This is the server or application that houses some sort of data. Common Data Sources might be ‘GIS’, ‘Finance System’, or ‘The File Cabinets in Room 203’. A Data Source can have multiple Data Sets.
- Data Set: This is a discrete collection of data. This might refer to a single table in a database, an subset of an application, or a data cube.
- Data Archetype: This is a category under which the dataset falls. A dataset may belong to multiple archetypes.
Features
Visual Cues
I want to incorporate several visual cues into the system. Currently, Active Data Sets are bordered in Green while inactive ones are bordered in gray. I also plan to visually indicate the refresh frequency and sensitivity.
Search
The system allows you to quickly highlight datasets matching a keyword, ‘graying out’ those that don’t match:
Data Relationship Exploration
I thought a data dictionary would come in most handy if one could visually see how the data relates. In our business analyst scenario in the introduction, I wanted to know where the Utility Billing, the Parcel, and the work order information resided:
By clicking on the items needed for my report, I can now get an overview of which data sources and data sets I need to take a look at, and I can also see which systems have multiple data sets already included (and therefore hopefully related).
Now and The Future
The project filled three side goals for me:
- I wanted to contribute something that was open source
- I wanted to play with Ruby on Rails 5 and webpack
- I wanted to try out React
I was thrilled to be able to get something productive out of React after only a few rage-quits. I’m also going to be extremely grateful to all those who help me clean up the the code and get it tested. This is a functional application but it’s unpolished and just a first draft to get feedback, especially on how I chose to do the relationship visualization.
As for the future, there are a lot of things I’d like to add, like:
- Splitting up internal and public views
- Adding authentication to manage that
- Allowing some auto-discovery of tables in a database
etc. Feel free to tweet me suggestions or submit them to issues and we’ll tag the requests accordingly.
Try It Out
I’ve put up a copy with sample data on Heroku at:
http://data-app.cityilluminated.com.
You can play with the various data sets and archetypes and get a feel for whether it makes sense for your organization. If it does, the repository is on Github:.
https://github.com/mccollek/data-dictionary
Download it and try it out! Please feel free to reach out to me with questions. Issues can be posted in Github and pull requests are most welcome!