PAKISTAN OPEN DATA PLAYBOOK FOR THE GOVERNMENT AND CITIZENS
An outline to make the government processes more transparent and to introduce the culture of open by default
- Introduction
- Definitions
- Implementation
- Tools and Technologies
- Case Studies/Use Cases
- Additional Resources
Introduction:
Do you know exactly how much of your tax money is spent on street lights or on cancer research? What is the shortest, safest and most scenic bicycle route from your home to your work? And what is in the air that you breathe along the way? Where in your region will you find the best job opportunities and the highest number of fruit trees per capita? When can you influence decisions about topics you deeply care about, and whom should you talk to?
New technologies now make it possible to build the services to answer these questions automatically. Much of the data you would need to answer these questions is generated by public bodies. However, often the data required is not yet available in a form which is easy to use. This book is about how to unlock the potential of official and other information to enable new services, to improve the lives of citizens and to make government and society work better.
The notion of open data and specifically open government data - information, public or otherwise, which anyone is free to access and re-use for any purpose - has been around for some years. In 2009 open data started to become visible in the mainstream, with various governments (such as the USA, UK, Canada and New Zealand) announcing new initiatives towards opening up their public information.
What Is Open Data?
Open data is as defined by the Open Definition:
Open data is data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike.
The full Open Definition gives precise details as to what this means. To summarize the most important:
- Availability and Access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.
- Re-use and Redistribution: the data must be provided under terms that permit re-use and redistribution including the intermixing with other datasets.
- Universal Participation: everyone must be able to use, re-use and redistribute - there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed.
If you’re wondering why it is so important to be clear about what open means and why this definition is used, there’s a simple answer: interoperability.
Interoperability denotes the ability of diverse systems and organizations to work together (inter-operate). In this case, it is the ability to interoperate - or intermix - different datasets.
Interoperability is important because it allows for different components to work together. This ability to componentize and to ‘plug together’ components is essential to building large, complex systems. Without interoperability this becomes near impossible — as evidenced in the most famous myth of the Tower of Babel where the (in)ability to communicate (to interoperate) resulted in the complete breakdown of the tower-building effort.
We face a similar situation with regard to data. The core of a “commons” of data (or code) is that one piece of “open” material contained therein can be freely intermixed with other “open” material. This interoperability is absolutely key to realizing the main practical benefits of “openness”: the dramatically enhanced ability to combine different datasets together and thereby to develop more and better products and services (these benefits are discussed in more detail in the section on ‘why’ open data).
Providing a clear definition of openness ensures that when you get two open datasets from two different sources, you will be able to combine them together, and it ensures that we avoid our own ‘tower of babel’: lots of datasets but little or no ability to combine them together into the larger systems where the real value lies.
Why Is It Important?
Open data, especially open government data, is a tremendous resource that is as yet largely untapped. Many individuals and organisations collect a broad range of different types of data in order to perform their tasks. Government is particularly significant in this respect, both because of the quantity and centrality of the data it collects, but also because most of that government data is public data by law, and therefore could be made open and made available for others to use. Why is that of interest?
There are many areas where we can expect open data to be of value, and where examples of how it has been used already exist. There are also many different groups of people and organisations who can benefit from the availability of open data, including the government itself. At the same time it is impossible to predict precisely how and where value will be created in the future. The nature of innovation is that developments often come from unlikely places.
It is already possible to point to a large number of areas where open government data is creating value. Some of these areas include:
- Transparency and democratic control
- Participation
- Self-empowerment
- Improved or new private products and services
- Innovation
- Improved efficiency of government services
- Improved effectiveness of government services
- Impact measurement of policies
- New knowledge from combined data sources and patterns in large data volumes
Examples exist for most of these areas.
In terms of transparency, projects such as the Finnish ‘tax tree’ and British ‘where does my money go’ show how your tax money is being spent by the government. And there’s the example of how open data saved Canada $3.2 billion in charity tax fraud. Also various websites such as the Danish folketsting.dk track activity in parliament and the law making processes, so you can see what exactly is happening, and which parliamentarians are involved.
Open government data can also help you to make better decisions in your own life, or enable you to be more active in society. A woman in Denmark built findtoilet.dk, which showed all the Danish public toilets, so that people she knew with bladder problems can now trust themselves to go out more again. In the Netherlands a service, vervuilingsalarm.nl, is available which warns you with a message if the air-quality in your vicinity is going to reach a self-defined threshold tomorrow. In New York you can easily find out where you can walk your dog, as well as find other people who use the same parks. Services like ‘mapumental’ in the UK and ‘mapnificent’ in Germany allow you to find places to live, taking into account the duration of your commute to work, housing prices, and how beautiful an area is. All these examples use open government data.
Economically, open data is of great importance as well. Several studies have estimated the economic value of open data at several tens of billions of Euros annually in the EU alone. New products and companies are re-using open data. The Danish husetsweb.dk helps you to find ways of improving the energy efficiency of your home, including financial planning and finding builders who can do the work. It is based on re-using cadastral information and information about government subsidies, as well as the local trade register. Google Translate uses the enormous volume of EU documents that appear in all European languages to train the translation algorithms, thus improving its quality of service.
Open data is also of value for the government itself. For example, it can increase government efficiency. The Dutch Ministry of Education has published all of their education-related data online for re-use. Since then, the number of questions they receive has dropped, reducing work-load and costs, and the remaining questions are now also easier for civil servants to answer, because it is clear where the relevant data can be found. Open data is also making government more effective, which ultimately also reduces costs. The Dutch department for cultural heritage is actively releasing their data and collaborating with amateur historical societies and groups such as the Wikimedia Foundation in order to execute their own tasks more effectively. This not only results in improvements to the quality of their data, but will also ultimately make the department smaller.
While there are numerous instances of the ways in which open data is already creating both social and economic value, we don’t yet know what new things will become possible. New combinations of data can create new knowledge and insights, which can lead to whole new fields of application. We have seen this in the past, for example when Dr. Snow discovered the relationship between drinking water pollution and cholera in London in the 19th century, by combining data about cholera deaths with the location of water wells. This led to the building of London’s sewage systems, and hugely improved the general health of the population. We are likely to see such developments happening again as unexpected insights flow from the combination of different open data sets.
This untapped potential can be unleashed if we turn public government data into open data. This will only happen, however, if it is really open, i.e. if there are no restrictions (legal, financial or technological) to its re-use by others. Every restriction will exclude people from re-using the public data, and make it harder to find valuable ways of doing that. For the potential to be realized, public data needs to be open data.
Purpose Of This Playbook
This book explains the basic concepts of ‘open data’, especially in relation to government. It covers how open data creates value and can have a positive impact in many different areas. In addition to exploring the background, the handbook also provides concrete information on how to produce open data.
This handbook targets a broad audience including:
- for those who have never heard of open data before and those who consider themselves seasoned ‘data professionals’
- for civil servants and for activists
- for journalists and researchers
- for politicians and developers
- for data geeks
- for those who have never heard of an API.
This handbook is intended for those with little or no knowledge of the topic. If you do find a piece of jargon or terminology with which you aren’t familiar, please feel free to reach out to us.
Definitions:
Basic Terminologies
Data: A value or set of values representing a specific concept or concepts. Data becomes “information” when analyzed and possibly combined with other data in order to extract meaning, and to provide context. The meaning of data can vary depending on its context. Data includes all data. It includes, but is not limited to, 1) geospatial data 2) unstructured data, 3) structured data, etc.
Open Data: Data is open if it can be freely accessed, used, modified and shared by anyone for any purpose - subject only, at most, to requirements to provide attribution and/or share-alike. Specifically, open data is defined by the Open Definition and requires that the data be A. Legally open: that is, available under an open (data) license that permits anyone freely to access, reuse and redistribute B. Technically open: that is, that the data be available for no more than the cost of reproduction and in machine-readable and bulk form.
Dataset: A dataset is an organized collection of data. The most basic representation of a dataset is data elements presented in tabular form. Each column represents a particular variable. Each row corresponds to a given value of that column’s variable. A dataset may also present information in a variety of non-tabular formats, such as an extensible mark-up language (XML) file, a geospatial data file, or an image file, etc.
CSV: A comma separated values (CSV) file is a computer data file used for implementing the tried and true organizational tool, the Comma Separated List. The CSV file is used for the digital storage of data structured in a table of lists form. Each line in the CSV file corresponds to a row in the table. Within a line, fields are separated by commas, and each field belongs to one table column. CSV files are often used for moving tabular data between two different computer programs (like moving between a database program and a spreadsheet program).
JSON: JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language, Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.
Metadata: To facilitate common understanding, a number of characteristics, or attributes, of data are defined. These characteristics of data are known as “metadata”, that is, “data that describes data.” For any particular datum, the metadata may describe how the datum is represented, ranges of acceptable values, its relationship to other data, and how it should be labeled. Metadata also may provide other relevant information, such as the responsible steward, associated laws and regulations, and access management policy. Each of the types of data described above has a corresponding set of metadata.
OpenSource Software: Computer software that is available in source code form: the source code and certain other rights normally reserved for copyright holders are provided under an open-source license that permits users to study, change, improve and at times also to distribute the software.
Open source software is very often developed in a public, collaborative manner. Open source software is the most prominent example of open source development and often compared to (technically defined) user-generated content or (legally defined) open content movements.
API:An application programming interface, which is a set of definitions of the ways one piece of computer software communicates with another. It is a method of achieving abstraction, usually (but not necessarily) between higher-level and lower-level software.
Characteristics of Open Data
In general, open data will be consistent with the following principles:
- Public. The data must be published in Open by default to the extent permitted by law and subject to privacy, confidentiality, security, or other valid restrictions.
- Accessible. Open data are made available in convenient, modifiable, and open formats that can be retrieved, downloaded, indexed, and searched. Formats should be machine-readable (i.e., data are reasonably structured to allow automated processing). Open data structures do not discriminate against any person or group of persons and should be made available to the widest range of users for the widest range of purposes, often by providing the data in multiple formats for consumption. To the extent permitted by law, these formats should be non-proprietary, publicly available, and no restrictions should be placed upon their use.
- Described. Open data are described fully so that consumers of the data have sufficient information to understand their strengths, weaknesses, analytical limitations, security requirements, as well as how to process them. This involves the use of robust, granular metadata (i.e., fields or elements that describe data), thorough documentation of data elements, data dictionaries, and, if applicable, additional descriptions of the purpose of the collection, the population of interest, the characteristics of the sample, and the method of data collection.
- Reusable. Open data is made available under an open license that places no restrictions on their use.
- Complete. Open data are published in primary forms (i.e., as collected at the source), with the finest possible level of granularity that is practicable and permitted by law and other requirements. Derived or aggregate open data should also be published but must reference the primary data.
- Timely. Open data are made available as quickly as necessary to preserve the value of the data. Frequency of release should account for key audiences and downstream needs.
- Managed Post-Release. A point of contact must be designated to assist with data use and to respond to complaints about adherence to these open data requirements.
Standards and Specifications
System owners and data owners should, wherever possible, consider relevant international and US standards for data elements. Standards bodies dealing with data include:
- ISO
- United Nations Centre for Trade Facilitation and Electronic Business (UN/CEFACT)
- W3C The World Wide Web Consortium
- IETF The Internet Engineering Task Force
- American National Standards Institute (ANSI)
- International Committee for Information Technology Standards (INCITS)
- The Federal Geographic Data Committee (FGDC)
- The National Information Exchange Model (NIEM)
Open Data Licenses
For the purposes of Open Data, the term “Open License” is used to refer to any legally binding instrument that grants permission to access, re-use, and redistribute a work with few or no restrictions. While technically not a “license,” worldwide public domain dedications such as Creative Commons Zero also satisfy this definition. An “Open License” must meet the following conditions:
- Reuse. The license must allow for reproductions, modifications and derivative works and permit their distribution under the terms of the original work. The rights attached to the work must not depend on the work being part of a particular package. If the work is extracted from that package and used or distributed within the terms of the work’s license, all parties to whom the work is redistributed should have the same rights as those that are granted in conjunction with the original package.
- Redistribution. The license shall not restrict any party from selling or giving away the work either on its own or as part of a package made from works from many different sources. The license shall not require a royalty or other fee for such sale or distribution. The license may require as a condition for the work being distributed in modified form that the resulting work carry a different name or version number from the original work. The rights attached to the work must apply to all to whom it is redistributed without the need for execution of an additional license by those parties. The license must not place restrictions on other works that are distributed along with the licensed work. For example, the license must not insist that all other works distributed on the same medium are open. If adaptations of the work are made publicly available, these must be under the same license terms as the original work.
- No Discrimination against Persons, Groups, or Fields of Endeavor. The license must not discriminate against any person or group of persons. The license must not restrict anyone from making use of the work in a specific field of endeavor. For example, it may not restrict the work from being used in a business, or from being used for research.
Examples of Open Licenses & Dedications
When agencies purchase data or content from third-party vendors, care must be taken to ensure the information is not hindered by a restrictive, non-open license. In general, such licenses should comply with the Open Knowledge Definition of an open license. Several examples of open licenses and dedications for use by agencies are listed below:
Worldwide Public Domain Dedications
- Creative Commons Zero Public Domain Dedication (CC0), e.g. "license":"https://creativecommons.org/publicdomain/zero/1.0/"
- Open Data Commons Public Domain Dedication and Licence (PDDL), e.g. "license":"http://opendatacommons.org/licenses/pddl/1.0/"
Open Licenses
- Open Data Commons Attribution License (ODC-By), e.g. "license":"http://opendatacommons.org/licenses/by/1.0/"
- Open Data Commons Open Database License (ODbL), e.g. "license":"http://opendatacommons.org/licenses/odbl/1.0/"
- Creative Commons Attribution (CC BY), e.g. "license":"https://creativecommons.org/licenses/by/4.0/"
- Creative Commons Attribution-ShareAlike (CC BY-SA), e.g. "license":"https://creativecommons.org/licenses/by-sa/4.0/"
- GNU Free Documentation License, e.g. "license":"http://www.gnu.org/licenses/fdl-1.3.en.html"
Implementation
How To Open Up Data
This section forms the core of this handbook. It gives concrete, detailed advice on how data holders can open up data. We’ll go through the basics, but also cover the pitfalls. Lastly, we will discuss the more subtle issues that can arise.
There are three key rules we recommend following when opening up data:
- Keep it simple.Start out small, simple and fast. There is no requirement that every dataset must be made open right now. Starting out by opening up just one dataset, or even one part of a large dataset, is fine – of course, the more datasets you can open up the better.
Remember this is about innovation. Moving as rapidly as possible is good because it means you can build momentum and learn from experience – innovation is as much about failure as success and not every dataset will be useful.
- Engage early and engage often.Engage with actual and potential users and re-users of the data as early and as often as you can, be they citizens, businesses or developers. This will ensure that the next iteration of your service is as relevant as it can be.
It is essential to bear in mind that much of the data will not reach ultimate users directly, but rather via ‘info-mediaries’. These are the people who take the data and transform or remix it to be presented. For example, most of us don’t want or need a large database of GPS coordinates, we would much prefer a map. Thus, engage with infomediaries first. They will reuse and repurpose the material.
- Address common fears and misunderstandings.This is especially important if you are working with or within large institutions such as the government. When opening up data you will encounter plenty of questions and fears. It is important to (a) identify the most important ones and (b) address them at as early a stage as possible.
The Process To Open Data
There are four main steps in making data open, each of which will be covered in detail below. These are in very approximate order - many of the steps can be done simultaneously.
- Choose your dataset(s).Choose the dataset(s) you plan to make open. Keep in mind that you can (and may need to) return to this step if you encounter problems at a later stage.
-
Apply an open license
- Determine what intellectual property rights exist in the data.
- Apply a suitable ‘open’ license that licenses all of these rights and supports the definition of openness discussed in the section above on ‘What Is Open Data’
- NB: if you can’t do this go back to step 1 and try a different dataset.
- Make the data available- in bulk and in a useful format. You may also wish to consider alternative ways of making it available such as via an API.
- Make it discoverable- post on the web and perhaps organize a central catalog to list your open datasets.
Choosing A Dataset
Choosing the dataset(s) you plan to make open is the first step – though remember that the whole process of opening up data is iterative and you can return to this step if you encounter problems later on.
If you already know exactly what dataset(s) you plan to open up you can move straight on to the next section. However, in many cases, especially for large institutions, choosing which datasets to focus on is a challenge. How should one proceed in this case?
Creating this list should be a quick process that identifies which datasets could be made open to start with. There will be time at later stages to check in detail whether each dataset is suitable.
There is no requirement to create a comprehensive list of your datasets. The main point to bear in mind is whether it is feasible to publish this data at all (whether openly or otherwise). Below are a few guiding questions that might help:
- What data does your department use for internal performance and trend analysis?
- What data populates your monthly or quarterly reports?
- What information is published as a performance metric (e.g. on performance.seattle.gov)?
- What information do you report to local, state, or federal agencies?
- What information do you share with other City departments?
- What information do you share with external partners?
- What information is repeatedly requested by the public, via the public disclosure process and/or open data requests?
- What kinds of open data are your peer agencies across the country publishing?
Involve The Community
We recommend that you ask the community in the first instance. That is the people who will be accessing and using the data, as they are likely to have a good understanding of which data could be valuable.
- Prepare a short list of potential datasets that you would like feedback on. It is not essential that this list concurs with your expectations, the main intention is to get a feel for the demand. This could be based on other countries’ open data catalogs.
- Create a request for comment.
- Publicise your request with a web page. Make sure that it is possible to access the request through its own URL. That way, when shared via social media, the request can be easily found.
- Provide easy ways to submit responses. Avoid requiring registration, as it reduces the number of responses.
- Circulate the request to relevant mailing lists, forums and individuals, pointing back to the main webpage.
- Run a consultation event. Make sure you run it at a convenient time where the average business person, data wrangler and official can attend.
- Ask a politician to speak on your agency’s behalf. Open data is very likely to be part of a wider policy of increasing access to government information.
Cost Basis
How much money do agencies spend on the collection and maintenance of data that they hold? If they spend a great deal on a particular set of data, then it is highly likely that others would like to access it.
This argument may be fairly susceptible to concerns of freeriding. The question you will need to respond to is, “Why should other people get information for free that is so expensive?”. The answer is that the expense is absorbed by the public sector to perform a particular function. The cost of sending that data, once it has been collected, to a third party is approximately nothing. Therefore, they should be charged nothing.
Ease Of Release
Sometimes, rather than deciding which data would be most valuable, it could be useful to take a look at which data is easiest to get into the public’s hands. Small, easy releases can act as the catalyst for larger behavioural change within organisations.
Be careful with this approach however. It may be the case that these small releases are of so little value that nothing is built from them. If this occurs, faith in the entire project could be undermined.
Observe Peers
Open data is a growing movement. There are likely to be many people in your area who understand what other areas are doing. Formulate a list on the basis of what those agencies are doing.
Template
The template for the common inventory, currently available as an Excel spreadsheet. It includes fields in three categories:
Basic Information
At-a-glance Assessment
- Quality
- Privacy
- Security
- Sensitivity
- Performance
- Priority
- Notes
Post Publication Tracking
- Date published
- Update period
- Date last updated
- Date of next update
Definitions
Here are the definitions as included in the template:
Quality
- Good: dataset is used routinely, is managed in a way that makes it legible to users besides the owner, and is reasonably accurate
- Acceptable: This dataset is used routinely but not necessarily legible to non-owners and/or has some gaps or discrepancies that limit its usability
- Poor: This dataset exists but is not regularly used, updated, or otherwise managed in a way that makes it valuable
Privacy
This is a flag only; datasets will be reviewed for privacy prior to publication. This list includes but is not limited to:
- Name and initials in any combination
- Identification number (e.g., CNIC)
- Age
- Gender
- Home address
- Home telephone number
- Personal cellular, mobile or wireless number
- Personal e-mail address
- Drivers’ license number
- Information on medical or health conditions
- Financial information (credit card, billing info, account info)
- Health information
- Marital status
- Nationality
- Sexual behavior or sexual preference
- Physical characteristics
- Racial or ethnic origin Religious, philosophical or political beliefs
- Biometric data Household information
- Consumer purchase or billing history
- Unique device identifiers (IP/ MAC addresses)
- Location (e.g., GPS) info (including that provided by mobile devices)
Security
Public
Public information can be or currently is released to the public. It does not need protection from unauthorized disclosure, but does need integrity and availability protection controls. This would include general public information, published reference documents (within copyright restrictions), open source materials, approved promotional information and press releases.
Examples:
- Information provided on City websites
- Information for public distribution (e.g. budget documents after public release)
- GIS maps
- Meeting agendas and minutes
Sensitivity
Sensitive information may not be specifically protected from disclosure by law and is for official use only. Sensitive information is generally not released to the public unless specifically requested. Although most all of this information is subject to disclosure laws, it still requires careful management and protection to ensure the integrity and obligations of the business operations and compliance requirements. It also includes data associated with internal email systems and user account activity information.
Examples:
- Work phone numbers
- Organizational charts
- Interdepartmental documents
- Policies, procedures, and standards
Confidential
Confidential information is information that is specifically protected in all or in part from disclosure under the laws of the area.
Examples:
- Personally Identifiable Information
- Information concerning employee personnel records
- Information regarding IT infrastructure and security of computer and telecommunications systems, information security plans
- Information related to law enforcement (e.g. witness protection information)
- Information related to minors (e.g. adoption and foster records)
Sensitivity
- High: This information is considered sensitive by stakeholders, leadership such that it would likely require outreach to those groups prior to publication (e.g. labor, vendors, impacted populations)
- Medium: This information has an impact on stakeholders that should be taken into account, but is unlikely to be disruptive to ongoing processes
- Low: This information is already public in some form or does not contain data that would be surprising to stakeholders
Priority
- High: There is demand for this dataset by the department and/or the public, or it is relevant to high-profile work or objectives.
- strong>Medium: The dataset should be published but is not an immediate priority
- Low: There is no apparent demand yet for this dataset
Apply an Open License
In most jurisdictions there are intellectual property rights in data that prevent third-parties from using, reusing and redistributing data without explicit permission. Even in places where the existence of rights is uncertain, it is important to apply a license simply for the sake of clarity. Thus, if you are planning to make your data available you should put a license on it – and if you want your data to be open this is even more important.
What licenses can you use? We recommend that for ‘open’ data you use one of the licenses conformant with the Open Definition and marked as suitable for data. This list (along with instructions for usage) can be found at:
A short 1-page instruction guide to applying an open data license can be found on the Open Data Commons site:
Another option that can be used to understand these licenses in summarized form and in plain and simple english:
Make Data Available (Technical Openness)
Open data needs to be technically open as well as legally open. Specifically, the data needs to be available in bulk in a machine-readable format.
Available
Data should be priced at no more than a reasonable cost of reproduction, preferably as a free download from the Internet. This pricing model is achieved because your agency should not undertake any cost when it provides data for use.
In Bulk
The data should be available as a complete set. If you have a register which is collected under the statute, the entire register should be available for download. A web API or similar service may also be very useful, but they are not a substitute for bulk access.
In An Open, Machine-readable Format
Re-use of data held by the public sector should not be subject to patent restrictions. More importantly, making sure that you are providing machine-readable formats allows for greatest re-use. To illustrate this, consider statistics published as PDF (Portable Document Format) documents, often used for high quality printing. While these statistics can be read by humans, they are very hard for a computer to use. This greatly limits the ability for others to re-use that data.
Here are a few policies that will be of great benefit:
- Keep it simple,
- Move fast
- Be pragmatic.
In particular it is better to give out raw data now than perfect data in six months’ time.
There are many different ways to make data available to others. The most natural in the Internet age is online publication. There are many variations to this model. At its most basic, agencies make their data available via their websites and a central catalog directs visitors to the appropriate source. However, there are alternatives.
When connectivity is limited or the size of the data extremely large, distribution via other formats can be warranted. This section will also discuss alternatives, which can act to keep prices very low.
Online Methods
Via Your Existing Website
The system which will be mos familiar to your web content team is to provide files for download from web pages. Just as you currently provide access to discussion documents, data files are perfectly happy to be made available this way.
One difficulty with this approach is that it is very difficult for an outsider to discover where to find updated information. This option places some burden on the people creating tools with your data.
Via 3rd Party Sites
Many repositories have become hubs of data in particular fields. For example, pachube.com is designed to connect people with sensors to those who wish to access data from them. Sites like Infochimps.com and Talis.com allow public sector agencies to store massive quantities of data for free.
Third party sites can be very useful. The main reason for this is that they have already pooled together a community of interested people and other sets of data. When your data is part of these platforms, a type of positive compound interest is created.
Wholesale data platforms already provide the infrastructure which can support the demand. They often provide analytics and usage information. For public sector agencies, they are generally free.
These platforms can have two costs. The first is independence. Your agency needs to be able to yield control to others. This is often politically, legally or operationally difficult. The second cost may be openness. Ensure that your data platform is agnostic about who can access it. Software developers and scientists use many operating systems, from smartphones to supercomputers. They should all be able to access the data.
Via FTP servers
A less fashionable method for providing access to files is via the File Transfer Protocol (FTP). This may be suitable if your audience is technical, such as software developers and scientists. The FTP system works in place of HTTP, but is specifically designed to support file transfers.
FTP has fallen out of favour. Rather than providing a website, looking through an FTP server is much like looking through folders on a computer. Therefore, even though it is fit for purpose, there is far less capacity for web development firms to charge for customisation.
As Torrents
BitTorrent is a system which has become familiar to policy makers because of its association with copyright infringement. BitTorrent uses files called torrents, which work by splitting the cost of distributing files between all of the people accessing those files. Instead of servers becoming overloaded, the supply increases with the demand increases. This is the reason that this system is so successful for sharing movies. It is a wonderfully efficient way to distribute very large volumes of data.
As an API
Data can be published via an Application Programming Interface (API). These interfaces have become very popular. They allow programmers to select specific portions of the data, rather than providing all of the data in bulk as a large file. APIs are typically connected to a database which is being updated in real-time. This means that making information available via an API can ensure that it is up to date.
Publishing raw data in bulk should be the primary concern of all open data initiatives. There are a number of costs to providing an API:
- The price. They require much more development and maintenance than providing files.
- The expectations. In order to foster a community of users behind the system, it is important to provide certainty. When things go wrong, you will be expected to incur the costs of fixing them.
Access to bulk data ensures that:
- There is no dependency on the original provider of the data, meaning that if a restructure or budget cycle changes the situation, the data are still available.
- Anyone else can obtain a copy and redistribute it. This reduces the cost of distribution away from the source agency and means that there is no single point of failure.
- Others can develop their own services using the data, because they have certainty that the data will not be taken away from them.
Providing data in bulk allows others to use the data beyond its original purposes. For example, it allows it to be converted into a new format, linked with other resources, or versioned and archived in multiple places. While the latest version of the data may be made available via an API, raw data should be made available in bulk at regular intervals.
For example, the Eurostat statistical service has a bulk download facility offering over 4000 data files. It is updated twice a day, offers data in Tab-separated values (TSV) format, and includes documentation about the download facility as well as about the data files.
Another example is the District of Columbia Data Catalog, which allows data to be downloaded in CSV and XLS format in addition to live feeds of the data.
Make Data Discoverable
Open data is nothing without users. You need to be able to make sure that people can find the source material. This section will cover different approaches.
The most important thing is to provide a neutral space which can overcome both inter-agency politics and future budget cycles. Jurisdictional borders, whether sectorial or geographical, can make cooperation difficult. However, there are significant benefits in joining forces. The easier it is for outsiders to discover data, the faster new and useful tools will be built.
Existing Tools
There are a number of tools that are live on the web that are specifically designed to make data more discoverable.
One of the most prominent is the DataHub and is a catalog and data store for datasets from around the world. The site makes it easy for individuals and organizations to publish material and for data users to find the material they need.
In addition, there are dozens of specialist catalogs for different sectors and places. Many scientific communities have created a catalog system for their fields, as data are often required for publication.
For Government
Open Data Champions
Each department’ open data champion is responsible for managing their department’s participation in the program. They work with leadership to set priorities, oversee the publication and ongoing management of datasets for their department, and participate in ongoing activities with the network of champions. They also keep the open dataset inventory up-to-date and contribute to annual reports and plans.
Data Owners
Data owners across City government contribute datasets to the inventory and to the portal.
Department Leadership
Department directors and managers guide their department’s participation in the open data program, setting open data-related performance goals for their teams, updating policies and procedures to reflect the open data policy, and making sure their staff have the time and resources to participate in the program. As needed, they work with open data champions to set priorities, engage stakeholders, and work through any sensitivities in their datasets prior to publication. They consult with open data champions at the start of new projects and software implementation to ensure that they facilitate compliance with the open data policy.
As it has emerged, orthodox practice is for a lead agency to create a catalog for the government’s data. When establishing a catalog, try to create some structure which allows many departments to easily keep their own information current.
Resist the urge to build the software to support the catalog from scratch. There are free and open source software solutions (such as CKAN) which have been adopted by many governments already. As such, investing in another platform may not be needed.
There are a few things that most open data catalogs miss. Your programme could consider the following:
- Providing an avenue to allow the private and community sectors to add their data. It may be worthwhile to think of the catalog as the region’s catalog, rather than the regional government’s.
- Facilitating improvement of the data by allowing derivatives of datasets to be cataloged. For example, someone may geocode addresses and may wish to share those results with everybody. If you only allow single versions of datasets, these improvements remain hidden.
- Be tolerant of your data appearing elsewhere. That is, content is likely to be duplicated to communities of interest. If you have river level monitoring data available, then your data may appear in a catalog for hydrologists.
- Ensure that access is equitable. Try to avoid creating a privileged level of access for officials or tenured researchers as this will undermine community participation and engagement.
For Civil Society
Be willing to create a supplementary catalog for non-official data.
It is very rare for governments to associate with unofficial or non-authoritative sources. Officials have often gone to great expense to ensure that there will not be political embarrassment or other harm caused from misuse or overreliance on data.
Moreover, governments are unlikely to be willing to support activities that mesh their information with information from businesses. Governments are rightfully skeptical of profit motives. Therefore, an independent catalog for community groups, businesses and others may be warranted.
What’s Next?
We’ve looked at how to make government information legally and technically reusable. The next step is to encourage others to make use of that data.
This section looks at additional things which can be done to promote data re-use.
Tell The World!
First and foremost, make sure that you promote the fact that you’ve embarked on a campaign to promote open data in your area of responsibility.
If you open up a bunch of datasets, it’s definitely worth spending a bit of time to make sure that people know (or at least can find out) that you’ve done so.
In addition to things like press releases, announcements on your website, and so on, you may consider:
- Contacting prominent organisations or individuals who work/are interested in this area
- Contacting relevant mailing lists or social networking groups
- Directly contacting prospective users who you know may be interested in this data
Understanding Your Audience
Like all public communication, engaging with the data community needs to be targeted. Like all stakeholder groups, the right message can be wasted if it is directed to the wrong area.
Digital communities tend to be very willing to share new information, yet they very rapidly consume it. Write as if your messages will be skimmed over, rather than critically examined in-depth.
Members of the tech community are less likely than the general public to use MS Windows. This means that you should not save documents in MS Office formats which can be read offline. There are two reasons for this:
- The first is that those documents will be less accessible. Rather than the document you see on your screen, readers may see an imperfect copy from an alternative.
- Secondly, your agency sends an implicit message that you are unwilling to take a step towards developers. Instead, you show that you are expecting the technology community to come to you.
Post Your Material On Third-party Sites
Many blogs have created a large readership in specialised topic areas. It may be worthwhile adding an article about your initiative on their site. These can be mutually beneficial. You receive more interest and they receive a free blog post in their topic area.
Making Your Communications More Social-media Friendly
It’s unrealistic to expect that officials should spend long periods of time engaging with social media. However, there are several things that you can do to make sure that your content can be easily shared between technical users. Some tips:
- Provide unique pages for each piece of content.
When a message is shared with others, the recipient of the referral will be looking for the relevant content quickly. - Avoid making people download your press releases.
Press releases are fine. They are concise messages about a particular point. However, if you require people to download the content and for it to open outside of a web browser, then fewer people will read it. Search engines are less likely to index the content. People are less likely to click to download. - Consider using an Open license for your content
Apart from providing certainty to people who wish to share your content that this is permissible, you send a message that your agency understands openness. This is bound to leave an impression far more significant to proponents of open data than any specific sentence in your press release.
Social Media
It’s inefficient for cash-strapped agencies to spend hours on social media sites. The most significant way that your voice can be heard through these fora is by making sure that blog posts are easily shareable. That means, before reading the next section, make sure that you have read the last. With that in mind, here are a few suggestions:
- Discussion fora
Twitter has emerged as the platform of choice for disseminating information rapidly. Anything tagged with #opendata will be immediately seen by thousands.LinkedIn has a large selection of groups that are targeted towards open data.While Facebook is excellent for a general audience, it has not received a great deal of attention in the open data community. - Link aggregators
Submit your content to the equivalent of newswires for geeks. Reddit and Hacker News are the two biggest in this arena at the moment. To a lesser extent, Slashdot and Digg are also useful tools in this area.
These sites have a tendency to drive significant traffic to interesting material. They are also heavily focused on topic areas.
Getting Folks In A Room: Unconferences And Meetups
Face-to-face events can be a very effective way to encourage others to use your data. Reasons that you may consider putting on an event include:
- Finding out more about prospective re-users
- Finding out more about demand for different datasets
- Finding out more about how people want to re-use your data
- Enabling prospective re-users to find out more about what data you have
- Enabling prospective users to meet each other (e.g. so they can collaborate)
- Exposing your data to a wider audience (e.g. from blog posts or media coverage that the event may help to generate)
There are also lots of different ways of running events, and different types of events, depending on what aim you want to achieve. As well as more traditional conference models, which will include things like pre-prepared formal talks, presentations and demonstrations, there are also various kinds of participant driven events, where those who turn up may:
- Guide or define the agenda for the event
- Introduce themselves, talk about what they’re interested in and what they’re working on, on an ad hoc basis
- Give impromptu micro-short presentations on something they are working on
- Lead sessions on something they are interested in
There is plenty of documentation online about how to run these kinds of events, which you can find by searching for things like: ‘unconference’, ‘meetup’, ‘speedgeek’, ‘lightning talk’, and so on. You may also find it worthwhile to contact people who have run these kinds of events in other countries, who will most likely be keen to help you out and to advise you on your event. It may be valuable to partner with another organisation (e.g. a civic society organisation, a news organisation or an educational institution) to broaden your base participants and to increase your exposure.
Making Things! Hackdays, Prizes And Prototypes
The structure of these competitions is that a number of datasets are released and programmers then have a short time-frame -running from as little as 48 hours to a few weeks - to develop applications using the data. A prize is then awarded to the best application. Competitions have been held in a number of countries including the UK, the US, Norway, Australia, Spain, Denmark and Finland.
Examples For Competitions
Show us a better way was the first such competition in the world. It was initiated by the UK Government’s “The Power of Information Taskforce” headed by Cabinet Office Minister Tom Watson in March 2008. This competition asked “What would you create with public information?” and was open to programmers from around the world, with a tempting £80,000 prize for the five best applications.
Apps for Democracy, one of the first competitions in the United States, was launched in October 2008 by Vivek Kundra, at the time Chief Technology Officer (CTO) of the District of Columbia (DC) Government. Kundra had developed the groundbreaking DC data catalog, http://data.octo.dc.gov/, which included datasets such as real-time crime feeds, school test scores, and poverty indicators. It was at the time the most comprehensive local data catalog in the world. The challenge was to make it useful for citizens, visitors, businesses and government agencies of Washington, DC.
The creative solution was to create the Apps for Democracy contest. The strategy was to ask people to build applications using the data from the freshly launched data catalog. It included an online submission for applications, many small prizes rather than a few large ones, and several different categories as well as a “People’s Choice” prize. The competition was open for 30 days and cost the DC government $50,000. In return, a total of 47 iPhone, Facebook and web applications were developed with an estimated value in excess of $2,600,000 for the local economy.
The Abre Datos (Open Data) Challenge 2010. Held in Spain in April 2010, this contest invited developers to create open source applications making use of public data in just 48 hours. The competition had 29 teams of participants who developed applications that included a mobile phone programme for accessing traffic information in the Basque Country, and for accessing data on buses and bus stops in Madrid, which won the first and second prizes of €3,000 and €2,000 respectively.
Nettskap 2.0. In April 2010 the Norwegian Ministry for Government Administration held “Nettskap 2.0”. Norwegian developers – companies, public agencies or individuals – were challenged to come up with web-based project ideas in the areas of service development, efficient work processes, and increased democratic participation. The use of government data was explicitly encouraged. Though the application deadline was just a month later, on May 9, the Minister Rigmor Aasrud said the response was “overwhelming”. In total 137 applications were received, no less than 90 of which built on the re-use of government data. A total amount of NOK 2.5 million was distributed among the 17 winners; while the total amount applied for by the 137 applications was NOK 28.4 million.
Mashup Australia. The Australian Government 2.0 Taskforce invited citizens to show why open access to Australian government information would be positive for the country’s economy and social development. The contest ran from October 7th to November 13th 2009. The Taskforce released some datasets under an open license and in a range of reusable formats. The 82 applications that were entered into the contest are further evidence of the new and innovative applications which can result from releasing government data on open terms. Now GovHack runs in multiple locations across Australia and New Zealand each year.
Conferences and Hackdays
One of the more effective ways for Civil Society Organisations (CSOs) to demonstrate to governments the value of opening up their datasets is to show the multiple ways in which the information can be managed to achieve social and economic benefit. CSOs that promote re-use have been instrumental in countries where there have been advances in policy and law to ensure that datasets are both technically and legally open.
The typical activities which are undertaken as part of these initiatives normally include competitions, open government data conferences, “unconferences”, workshops and “hack days”. These activities are often organised by the user community with data that has already been published proactively or obtained using access to information requests. In other cases, civil society advocates have worked with progressive public officials to secure new releases of datasets that can be used by programmers to create innovative applications.
Tools and Technologies:
This section is a list of ready-to-use solutions or tools that will help the interested in jump-starting their open efforts. These are real, implementable, coded solutions that were developed to significantly reduce the barrier to implementing open data initiatives.
Commonly used Open Data Platforms:
CKAN
CKAN is an open-source data catalog formally supported by the Open Knowledge Foundation, and can be installed on any Linux server, including cloud-hosted configurations. The Open Knowledge Foundation also offers hosting services for a monthly fee. CKAN is written in the Python programming language and designed for publishing and managing data either through a user interface or an API.
CKAN has a modular architecture through which additional or custom features may be added. For example, the DDI Importer extension (sponsored by the World Bank) provides support for the DDI metadata standard, including harvesting of metadata from microdata catalogs.
Examples
DKAN
DKAN is designed to be “feature compatible” with CKAN. This means that its underlying API is identical, so systems designed to be compatible with CKAN’s API should work equally well with DKAN. DKAN is also open source, but it is based on Drupal, a popular content management system written in PHP instead of Python. This may be more appealing to organizations that have already invested in Drupal-based websites. Drupal has its own modular architecture with thousands of modules available for download. It also has an option to customize modules and a large developer community.
Examples
Junar
Junar is a cloud-based SaaS Open Data platform, so data is typically managed within Junar’s infrastructure (the “all-in-one” model). Junar can provide either a complete data catalog or data via an API to a separate user catalog.
Examples
OpenDataSoft
OpenDataSoft is a cloud-based SaaS platform that offers a comprehensive suite of Open Data and visualization tools. The front end is fully open source. The platform supports common Open Data formats such as CSV, JSON and XML, along with geospatial formats such as KML, OSM and SHP. Search functionality is easy to use and the platform is available in multiple languages.
World Bank partners can freely access a version of OpenDataSoft here.
Examples
Semantic Media Wiki
Semantic MediaWiki is an extension of MediaWiki – the wiki application best known for powering Wikipedia. While traditional wikis contain only text, Semantic MediaWiki adds semantic annotations that allow a wiki to function as a collaborative database and data catalog. Semantic MediaWiki is an RDF implementation, meaning that both data and metadata are stored as linked data and are accessible via linked data interfaces such as SPARQL.
Examples
Socrata
Socrata is a cloud-based SaaS Open Data catalog platform that provides API, catalog and data manipulation tools. One distinguishing feature of Socrata is that it allows users to create views and visualizations based on published data and save them for others to use. Additionally, Socrata offers an open-source version of its API, intended to facilitate transitions for customers that decide to migrate away from the SaaS model.
Examples
Swirrl
Swirrl is a cloud-based SaaS Open Data platform built on linked data technologies (such as RDF and SPARQL) designed to achieve 100% compliance with the 5-star Open Data model. Swirrl, however, also makes data available via more conventional structures such as CSV.
Examples
Geospatial Data Platforms
ArcGIS Open Data
ArcGIS Open Data is a cloud-based SaaS platform where users can explore both spatial and non-spatial data in a consistent interface, allowing extraction of specific features and download in multiple open formats and APIs. It is included for free with ArcGIS Online, leverages ArcGIS services, and integrates with hundreds of open-source applications for mobile, web and desktop. ArcGIS Open Data uses Koop, an open-source ETL engine that automatically transforms web services into accessible formats.
Examples
GeoNode
GeoNode is an open source platform for developing geospatial information systems (GIS) and for deploying spatial data infrastructures. It is designed to be extended and modified, and can be integrated into existing platforms.
Examples
- UN World Food Programme
- Haiti Data
- GeoSINAGER (Bolivia)
Additional Reading
These links provide more information and background on technology options.
- Technology Options for Open Government Data Platforms (World Bank, January, 2014). This white paper discusses characteristics of several products and services provided by different organizations.
- Technical Assessment of Open Data Platforms for National Statistical Organisations (World Bank, October, 2014). This research report is intended to provide a better understanding and assessment of the technical issues related to data dissemination tools that NSOs use to distribute data to the public under an Open Data initiative.
- Open Data Checklist. This checklist of Open Data best practices provides a good reference for the typical requirements of an Open Data platform.
- Open Data portal requirements (Center for Government Excellence). This document contains a set of sample requirements to help governments evaluate, develop (or procure), deploy, and launch an open data web site (portal).
- ODI Presentation: How to choose the right Open Data platform for you (Open Data Institute, 2014). This slide deck presentation gives a thorough overview of the key considerations in choosing an Open Data platform, and includes a brief overview of several of the most prominent products.
- Recommendations for Open Data portals: from setup to sustainability. This report sets out how portals can move from setup to sustainability, with recommendations in the areas of governance, financing, architecture, operations, and metrics.
Case Studies/Use Cases
- https://strategy.data.gov/use-cases/
- Open 311 (all non-emergency issue reporting) Sample use cases: FixMyStreet UK, SeeClickFix, Chicago Works
- Open Referral (community resources and social services) Sample use cases: Purple Binder and mRelief
- Open Trails (public trails and related geographic data) Sample use cases: Totago
- Housefacts (residential buildings) Sample use case: Trulia
- Building & Land Development Specification (BLDS) (commercial buildings and permits) Sample use case: Seattle in Progress
- Open Contracting (public contracting) Sample use case: dgMarket
- General Transit Feed Specification (GTFS) (transit) Sample use cases: OneBusAway, TransitApp
- State Decoded (laws, codes, and statutes) Sample use case: LawHelp.org
- Federal Spending Transparency (budget information) Sample use case: USAspending.gov