This article is based on my limited experience of datavisualization, and I would really love to receive some feedback on it... Please don't hesitate to reach me if you have anything to say about it!
At qunb, we spent one year to build a tool that tells meaningful stories from Google Analytics. These stories are actually a series of clear messages supported by relevant charts and visual representations.
With the exception of the books from Edward Tufte, I didn't had a lot of opportunities to learn about datavisualization before joining qunb. Now that I'm a bit more familiar with this fields, I have to say that I'm not really comfortable with the way most of the datavisualizations (let's say dataviz for conveniance) are created today. I often hear or read rules statings things like "continues values should be displayed as a line chart and discrete values as a bar chart"...
That's fine, all these rules are very reasonable... but they're going beyond their role of simple common sense constraints, and they're overwhelming what's really at stake in a dataviz: the communication of an idea.
We build dataviz according to the nature of the data instead of the message that we want to communicate.
Consequently, a large part of the dataviz that we consume everyday are really poorly designed and end up to be completely indigestible. They don't address a clear use case. They just follow existing patterns, because it's so easy to simply update data from an old excel template, or because most of these dataviz are now automated, meaning that there is not a real person that is trying to communicate a message (and we're still pretty bad at making programs that know what story tell from a specific dataset).
So if we used a line chart to display our sales of last year, why should we change this year ? Even if the message that we want the user to receive is completely different today...
These dataviz are data driven. I'll try to explain why I think that's a bad approach, and how to switch to user driven representations.
Here are some data.
This dataset is expressed through numerical values in a table.
A numerical value is a great way to communicate quantities (doh!).
And a table is a good way to structure information.
In a way, this representation is already a kind dataviz (in a very simple form).
But there is way more than structures and quantities in this dataset... there are also things that we can't see here (or barely see), things that lie between numbers and lines : relationships and patterns.
Dataviz is an abstraction of these patterns and relationships.
"Visual hardware" has high capacity. We can "sense" a lot of visual information.
But "cognitive bandwith" is limited, and we can't really process all this visual information.
Selecting which information deserves attention can be tricky, and very often, the user is abandonned at this step.
In our table example, we can of course discover some patterns just by reading the numbers in the cells, but it's hard because it requires a lot of mental work to visualize these abstractions: "this number is greater than this one and this one and... and this one seems to be linked to this one, but that's not true for this dimension and...".
In the same way that numbers are symbols that wrap the complex abstractions that are quantities, we need something to help us to "preprocess" relevant information in this other kind of complex abstraction that is relationship between quantities. We need to be able to gain some knowledge from the data.
That's exactly the goal of a dataviz.
Actually, we can identify 2 ways.
Data exploration is a process where the user pulls the knowledge from the representation.
Facilitating the extraction of a message is the first use case.
Data explanation is a process where the representation pushes the knowledge to the user.
Delivering a message is the first use case.
There is no better way between these 2 options, it's just a matter of identifying which of them you're trying to address as a first use case. Moreover, things are not just black or white, and we can find parts about both data exploration and data explanation in most of the representations. These two approaches are more like extremities of a scale... but it's important to identify where a dataviz should stand on this scale, because it allows to be bold and thus more clear and efficient.
Most of the dataviz that we use everydays are about data exploration, not because this has been identified as the use case to follow, but simply because they are data driven, meaning that they have been designed according to the type of data that they need to show, and not to the message that they want to communicate to the user. It's easier to fall on the data exploration side when you don't know the use case : just throw a bunch of data in the most common template, and let the user handle it.
Data explanation is almost always completely ignored.
Let's come back to our table example.
That's actually an interesting (and quiet extreme) way to let the user explore a dataset. But as we just explained it, data exploration is over used and we'll probably need to look in the direction of data explanation if we really had to design a representation to exploit this dataset.
Let's pretend that we'll have to send a report about its visits per countries to our client.
What representation would allow us to push a clear message to the user ?
The common approach is to choose the representation depending on the type of the data. Here, we're dealing with locations, so we'll end up with a map.
Think about visits per countries on Google Analytics, or any kind of geolocation based data... no matter what's interesting in the dataset, you'll often end up with a map.
But what kind of information does a map convey?
Well, as any kind of dataviz, it helps to see patterns. Location patterns of course. And a map is very bad for everything else.
So if we wanted to show a pattern about locations, like showing that visits are more important when we get closer to Scandinavia for instance, a map would be great.
But let's be honest, that's not something that happens very often, yet we always use a map for location based data, no matter what message to get from it. Here, the data rules the representation.
This is a data driven approach.
If we come back to our "location based" dataset, we can notice a lot of interesting things to say. For instance, Germany seems to have performed very well compared to other countries in 2011.
That's not easy to see there:
That's not really better here:
A map is not a good representation to rank values.
So if we want to show that Germany performed very well in 2011, why not try to find a good representation to compare values ?
Like a bar chart:
Here, the message that we're trying to address to the user rules the representation.
This is a user driven approach.
There are tons of ways to represent quantitative information... and people are really creative...
Yet, some common representations are always interesting. To help us to choose the right one or to create a new one, it's a good idea to try to define the type of message that we're trying to address.
Think about all these small questions that you ask yourself (sometimes uncousciously) when you're in front of a dataset : is this number greater than this one ? And what about this one ? Is this one linked to this one ? What does this value represent ? etc... they all help to highlight patterns and to shape your mental model of the dataset. These questions are your users' needs and knowing about them, and thus how to help the user to answer them, helps the design of a dataviz to be more efficient.
If you think about it, all these questions are always about comparison. Comparison is a kind of "primitive user need" when we design dataviz.
The goal is always to allow the user to compare values or series of values in order to:
Here again, things are not just black or white, and we'll often try to communicate several type of messages (and even several messages of the same type). We'll probably have to create more complex representations to do that (like composing several charts in one for instance). That's fine, we just have to keep in mind that the more messages we're trying to address, the less each of them will be understandable.
Some rules of thumb can help to pick the right chart.
For instance, some charts use areas to convey more than one variable through a shape (using height to illustrate one variable, and width for another one for instance).
This kind of representations is incredibly hard to read, because it's almost impossible to compare areas.
But areas can be useful to illustrate the composition of an element because it's easier to imagine an element when it's pictured as a whole shape.
It's harder to compare each part of the whole, but it's easier to understand the abstraction of the whole, and the link between these parts.
That means that we can define 2 important properties that make a good representation :
Stephen Few wrote a very interesting paper about rules for encoding values in graphs. He talks about some kinds of basic visual entities that are used to encode information : points, lines, and bars.
These 3 visual elements are the most common ones, but if we step down one level of abstraction, we can try to define some properties that would help to understand how to convey information in any kind of visual representations.
Four visual properties seem interesting here:
Let's study all these properties through an example.
This chart is really interesting because it leverages all these visual properties to convey meaning.
The shape of a visual element is an amazing tool to create abstractions in many different ways.
Here, we can notice two interesting usages of shapes. First, this graph uses flowers to represent the number of deaths in wars. That's a pure iconic purpose, and that's the most important role of the shape: acting like a placeholder for a concept. That's exactly what happens when we draw countries on a map. The shape of each country helps the reader to create a match between the visual representation and the concept of "France" or "Spain".
We can find another interesting usage of shapes in the way how lines are used to show the duration of the war. A line is a natural way to highlight a link between entities, and that's why we often use lines to illustrate evolutions (a link between two states in time). Here, lines are used to create a link between a starting date and an ending date.
Colors can be very hard to master. It's tempting to think that colors bring meaning by themselves, because we often associate a color to a feeling, or an object... but it's very easy to confuse a reader who is not able to understand this association. In those cases, colors can be harmful for the readability.
In this graph, colors are used to create categories of elements. That's very efficient, because it's easy to notice color patterns, even when the elements are not close from each other.
But the most basic usage of color is probably to highlight elements. Again, color matching (and thus contrast) is amazingly powerful.
In this graph, the dimensions of the flowers is simply used to illustrate the number of deaths. That's the most basic usage of dimensions, exactly like we do when we encode a value in the height of bars in a bar chart.
Finally, the position of the flowers and their lines tells us where we are in time. Again, that's a very common practice that we can see in most of the line charts.
When designing dataviz, Identifying interesting messages in a dataset can be challenging.
One of the most famous example of this issue is known as Anscomb's Quartet.
This extrem example shows very well that extracting relevant facts from a dataset can be hard. Actually, the designer is in the shoes of a user alone in front of a raw dataset. This is a clear data exploration use case...
There is an interesting approach to adopt in order to frame a dataset and to reveal interesting patterns.
As described by Bret Victor in the article Up and Down the Ladder of Abstraction: "to understand a system, we must explore it".
In this article, Bret Victor explains that we can explore complex systems (like complex datasets) by constantly moving in the level of abstraction that we adopt to represent this system, in order to see and to manipulate different layers of concepts.
There is a lot of ways to visually represent a dataset, and each of these representations stands on a specific level of abstraction: numbers are abstractions of quantities, mean and variance are abstractions of a series of numbers, a table is an abstraction of the structure of these numbers, a line chart is an abstraction of their evolution, a bar chart is an abstraction of their ranking,... when we're trying to know what messages to get from a dataset, it's usually a good idea to explore the data through different representations, and thus different level of abstraction, in order to reveal interesting patterns to address.
That's how you realise that a dataset like the one described in Anscomb's Quartet is actually more complex than how it looked at first sight.
Visual design is not the only way to convey information and you should use all the tools available on your medium to achieve this mission.
Animations for instance can be an amazing tool.
Let's take an example with this very bad representation.
Beyond the difficulty to compare areas, the problem is that there is a lot of dimensions (a measure, some categories, some sub categories for each category,...), and the relationship between each of them is not clear.
Let's see how animation could help us here with this very simple example:
By animating the way we show all the elements of this chart, we can help the user to understand relationships between them : we start by showing main categories from the left to the right, and then we draw sub categories inside each category from the bottom to the top.
Copywriting is also very important.
The words that you choose, the symbols that you use for your units, the way you truncate labels or round up values, the title, the labels and the values that you choose to display,... all these decisions are fundamental in the way the user will perceive your message.
Interactions are important too, particularly for data exploration. You can engage the user to discover interesting facts by allowing him to actually manipulate the dataset.
And what about your "user's flow" ?
You don't always have to tell everything in one representation. Complicated problems can easily be split into small parts that become often easier to understand. You can spread representations either in space (show multiple representation in one page for instance), or in time (show a series of representations one after the other). You can also leverage the succession of these representation to create a journey that will also convey a meaning (diving step by step into the details of a value for instance).
Whatever you decide, it has to be intentional and meaningful.
Tools are many, the most important thing is to be conscious of them, and to leverage them as much as possible to achieve the first use case that you defined: data exploration, or data explanation.
Because that should always be your first concern when designing something : understanding what you're trying to achieve. Just like the way we switched from data driven to user driven approach in web design 15 years ago (remember old email clients that were just mapping of databases ? What a journey to come to something like Google Inbox!), we need to move away from meaningless and unintentional dataviz.
The type of data that we want to show should only be a constraint, the use case that we want to address, and thus the user, should always be the driver of a dataviz.
Don't hesitate to have a look at other case studies.