Brands, journalists, and governments are all interacting with more data than ever before. What’s interesting about this recent trend is not just the opportunity to foster greater transparency in government and business, but to tell stories that no one else is telling.

For the uninitiated, data journalism can seem intimidating. Where do you even begin? As the founding data journalist at Percolate, here are 10 questions I ask myself when approaching a data journalism project.

1) What is the story that we want to tell?

Good data journalism is more than slapping together a few charts together and calling it a day. In fact, ideally data journalism doesn’t begin with the data. It begins with curiosity. How much money does a multi-channel marketing campaign for a Fortune 500 company actually cost? Is there a relationship between advertising spend and the success of a marketing campaign? Having a story arc in mind can keep you organized and prioritizing the right things as you start to dig in.

2) What datasets can we find?

Sometimes you start with the story. And sometimes you just come across an awesome dataset that no one has really touched yet. Looking at it, the whisper of a story begins to take hold. But where can you find data? Contrary to what people think, a lot of great datasets are free and publicly available to use. For example, FiveThirtyEight, BuzzFeed, The Guardian Datablog, The Upshot, and The Huffington Post all have Github repos that you can access and use for your own journalism efforts.

3) What data do we have?

A good 50-70% of data journalism is just figuring out what stories the data in front of you can tell. It’s rare that the data is totally clean. These spreadsheets usually need to be tidied up – maybe you have extraneous columns, inconsistencies in the formatting of values, or new columns with new values that need to be added for grouping and analysis. Using tools like Open Refine can make this easier, but sometimes you are just staring at an Excel spreadsheet and doing replacements. Find and replace is your friend.

4) What data is missing?

At this point you probably have some idea of what you have in front of you. And it’s usually not the entire story you want to tell, or there is a richer story that can be told if you could just find some additional data. Having a larger story in mind when you begin data journalism is particularly helpful when considering what’s missing. Maybe you need to combine a dataset of initial public offering (IPO) data with Crunchbase data on funding rounds or investors to see if certain investors saw larger IPOs than others.

5) How can I combine these datasets?

Determining how to combine two different datasets is part art, part science. Combining datasets can feel a little bit like match-making. You may have two pieces of data that seem like they should go together at first blush, but in truth they just don’t match. A lot of the science here involves actually merging the two files and then cleaning the newly merged file. Knowing a programming language like R or Python can be valuable to finesse the data into an analyzable form. Even if you can’t program, tools like Data Wrangler or regular expressions can streamline this process.


What do data journalists do with data before you see it? Ex-Twitter, now Google data journalism editor Simon Rogers breaks the process down in this infographic from the Guardian Datablog.

6) What can we analyze?

You’ve done it. Row upon row of clean data, analyses just waiting to be run. As a data journalist you are trying to ask yourself, is there a story here? Sure, you began this whole process with something in mind, but at this point what you actually have may be a bit different than where you started from. I like to think about the various ways I can group my data, and what those group differences say about the larger topic. For example, do LinkedIn influencers really write the top performing posts on LinkedIn? Interestingly, the answer is no.

7) Does our analysis make sense?

This is a sanity check. You should try to understand if your methodology and results make sense. The story that emerges may not be totally simple, but it should be accurate and thought through. One good way to test whether you really understand the results is to explain your findings to a friend or family member.

8) What’s the best approach to telling this story?

Edward Tufte, arguably the founder of modern data journalism, in his landmark reference book “The Visual Display of Quantitative Information” writes that, “excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency.” Whatever your approach, your visualization should bring understanding and coherence to the data, avoid distortions or distractions, and finally reward careful study – allowing the viewer to think about the substance behind your work.

9) How can we engage our audience?

The best metric for quality data journalism is the extent that people engage with what you’ve done. Particularly now, with social media this is happening faster than ever. Contributing to, or creating conversations is why data journalism matters.

10) Can we make our data publicly available to use?

Statistics used to be something that would happen behind closed doors in government or business. Or, in the case of academia, you would need a doctorate to understand it. Not so anymore. Through the web we (and by “we” I mean all people) now have the tools and the training to crunch truly massive datasets. Open data means machine-readable data that anyone can use. The promise of open data is better information, greater accountability, and ultimately new and more interesting stories.