User driven dataviz

qunb - 2015

With the exception of Edward Tufte's books, I never had much opportunities to learn about datavisualization before joining qunb (where we were building a tool that tunred Google Analytics into meaningful "datastories"). Now that I'm slightly more familiar with the domain, I realise that I'm not completely comfortable with the way most of the datavisualizations are created.
When I tried to learn more about datavizualisation good practices, I very often came across rules statings things like "continues values should be displayed as a line chart and discrete values as a bar chart"... And that's fine, all these rules are very reasonable. So reasonable that it's almost common sense. And by focusing on these basic constraints and rules, we miss the core of the datavizualisation domain: the communication of an idea.

We design datavizualisation based on the type of data that we want to show instead of the message that we want to communicate.
Consequently, a large part of the datavizualisations that we consume everyday are really poorly designed and end up being completely indigestible. They don't address a clear use case. They just (at best) follow existing patterns, because it's so easy to simply update data from an old excel template, or because most of these datavizualisations are now automated, meaning that nobody is trying to think about the message that should be communicated.
So if we used a line chart to display our sales of last year, why should we change this year ? Even if the message that we want to communicate is completely different...
These datavizualisations are data driven. I'll try to explain why that's a bad approach, and how to switch to user driven representations.

But first, what is a datavizualisation?

Here is a simple dataset.

Here is a simple dataset

This dataset is expressed through numerical values in a table.
A numerical value is a great way to communicate quantities (doh!).
And a table is a good way to structure information.
In some ways, this representation is already a kind datavizualisation (in a very simple form).
But there is way more than structures and quantities in this dataset... there are also things that we can't see here (or barely see), things that lie between numbers and lines : relationships and patterns.
A datavizualisation is an abstraction of these patterns and relationships.

We can already guess some relationships in this table... why going further ?

Our "visual hardware" is quite powerful. We can "sense" a lot of visual information.
But "cognitive bandwith" is limited, and we can't really process all this visual information.
Selecting which information deserves attention can be tricky, and very often, the user is abandonned at this step.
In our table example, we can of course discover some patterns just by reading the numbers in the cells. But it's hard because it requires a lot of mental gymnastic: "this number is greater than this one and this one and... and the evolution of this one seems to be linked to this one, but that's not true for this dimension and...".
In the same way that numbers are symbols that wrap the complex abstractions that are quantities, we need something to help us to "preprocess" relevant information in this other kind of complex abstraction that is relationship between quantities. We need to be able to gain some knowledge from the data.
That's exactly the goal of a datavizualisation.

How to extract knowledge from data

Actually, we can identify 2 ways.

Data exploration is a process where the user pulls the knowledge from the representation.
Facilitating the extraction of a message is the first use case.
Data explanation is a process where the representation pushes the knowledge to the user.
Delivering a message is the first use case.

There is no better way between these 2 options, it's just a matter of identifying which of them you're trying to address as a first use case. Moreover, things are not just black or white, and most of the time we'll need a bit of both. These two approaches are more like extremities of a scale... but it's important to identify where a dataviz should stand on this scale, because it allows to be bold and thus clearer and more efficient.

Most of the datavizualisations that we use everydays are about data exploration, not because this has been identified as the use case, but simply because they are data driven, meaning that they have been designed according to the type of data that they need to show, and not to the message that they want to communicate to the user. It's easier to fall on the data exploration side when you don't know the use case: just throw a bunch of data in the most common template, and let the user handle it.
Data explanation is almost always completely ignored.

Let's come back to our table example.

Let's come back to our dataset

That's actually an interesting (and quiet extreme) way to let the user explore a dataset. But as we just explained it, data exploration is over used and we'll probably need to look in the direction of data explanation if we really had to design a meaningful representation of this dataset.
Let's pretend that we want to send a report about visits per countries to a client.
What representation would convey the clearest message?

The common approach is to choose the representation depending on the type of the data. Here, we're dealing with locations, so we'll end up with a map.

A map

Think about visits per countries on Google Analytics, or any kind of location based data... no matter what's interesting in the dataset, you'll often end up with a map.
But what kind of information does a map convey?
As any kind of datavizusalition, it helps to see patterns. Location based patterns of course. And a map is very bad for everything else.

So if we wanted to show a pattern about location, like showing that visits are more important when we get closer to Scandinavia for instance, a map would be great.

A map is relevant to show location based patterns
A map is relevant to show location based patterns

But let's be honest, that's not something that happens very often, yet we always use a map for location based data, no matter what message to get from it. Here, the data type drives the representation.
This is a data driven approach.

If we come back to our location based dataset, we can notice a lot of interesting things to say. For instance, Germany seems to have performed very well compared to other countries in 2011.
That's not easy to see there:

That's not really better here:

A map is not a good representation for ranking values.

So if we want to show that Germany performed very well in 2011 compared to other countries, why not try to find a good representation to compare values?
Like a simple bar chart:

Here, the message drives the representation.
This is a user driven approach.

Fine, but that doesn't help to know how to choose a good representation...

There are tons of ways to represent quantitative information... and people are really creative...
Yet, there is a few set of common representations that are always interesting. To help us to choose the right one or to create a new one, it's a good idea to try to define the type of message that we're trying to address.

Think about all these small questions that you ask yourself (sometimes uncousciously) when you're in front of a dataset : is this number greater than this one? Is this one linked to this one? What does this value represent? etc... they all help to highlight patterns and to shape your mental model of the dataset. These questions are your users' needs and knowing about them, and thus how to help the user to answer them, will help the design of a more relevant datavizualisation.
If you think about it, all these questions are always about comparison. Comparison is a kind of "primitive user need" when we design datavizusalisations.

The goal is always to allow the user to compare values or series of values in order to:

Here again, things are not just black or white, and we'll often try to communicate several types of messages (and even several messages of the same type). We'll probably have to create more complex representations to do that (like a composition of several charts for instance). That's fine, we just have to keep in mind that the more messages we're trying to address, the less each of them will be understandable.

Some rules of thumb can help to pick the right chart.
For instance, some charts use areas to convey more than one variable through a shape (using height to illustrate one variable, and width for another one for instance).

Areas are hard to compare
Areas are hard to compare

This kind of representations is incredibly hard to read, because it's almost impossible to compare areas.

But areas can be useful to illustrate the composition of an element because it's easier to imagine an element when it's pictured as a whole shape.

But areas are great to  picture a concept
But areas are great to picture a concept

It's harder to compare each part of the whole, but it's easier to understand the abstraction of the whole, and the link between these parts.

That means that we can define 2 important properties that make a good representation :

And how does a visual representation actually convey information?

Stephen Few wrote a very interesting paper about rules for encoding values in graphs. He talks about some kinds of basic visual entities that are used to encode information : points, lines, and bars.

These 3 visual elements are the most common ones, but if we step down one level of abstraction, we can try to define the properties that would help to understand how to convey information in any kind of visual representations.
Four visual properties seem interesting here:

Let's study all these properties through an example.

An interesting chart to study visual properties
Chart by Valentina D'Efilipo

This chart is really interesting because it leverages all these visual properties to convey a meaning.

The shape of a visual element is an amazing tool to create abstractions in many different ways.
Here, we can notice two interesting usages of shapes. First, this graph uses flowers to represent deaths in wars. That's a pure iconic purpose, and that's the most important role of the shape: acting like a placeholder for a concept. That's exactly what happens when we draw countries on a map. The shape of each country helps the reader to create a match between the visual representation and the concept of "France" or "Spain".
We can find another interesting usage of shapes in how the lines are used to show the duration of the war. A line is a natural way to highlight a link between entities, and that's why we often use lines to illustrate evolutions (a link between two states in time). Here, lines are used to create a link between a starting date and an ending date.

Colors can be very hard to master. It's tempting to think that colors bring meaning by themselves, because we often associate a color to a feeling, or an object... but it's very easy to confuse a reader who is not able to understand this association. In those cases, colors can be harmful for the readability.
In this graph, colors are used to create categories of elements. That's very efficient, because it's easy to notice color patterns, even when the elements are not close from each other.
But the most basic usage of color is probably to highlight elements. Again, color matching (and thus contrast) is amazingly powerful.

In this graph, the dimension of the flowers is simply used to illustrate the number of deaths. That's the most basic usage of dimension, exactly like when we encode a value in the height of bars in a bar chart.

Finally, the position of the flowers and their lines tells us where we are in time. Again, that's a very common practice that we can see in most of the line charts.

We know how to map a message to a representation, but how to identify interesting messages in a first place?

When designing dataviz, Identifying interesting messages in a dataset can be challenging.
One of the most famous example of this issue is known as Anscomb's Quartet.
This extrem example shows very well that extracting relevant facts from a dataset can be hard. Actually, the designer is in the shoes of a user / reader, alone in front of a raw dataset. This is a clear data exploration use case...

There is an interesting approach to adopt in order to frame a dataset and to reveal interesting patterns.
As described by Bret Victor in the article Up and Down the Ladder of Abstraction: "to understand a system, we must explore it".
In this article, Bret Victor explains that we can explore complex systems (like complex datasets) by constantly moving in the level of abstraction that we adopt to represent this system, in order to see and to manipulate different layers of concepts.
There is a lot of ways to visually represent a dataset, and each of these representations stands on a specific level of abstraction: numbers are abstractions of quantities, mean and variance are abstractions of a series of numbers, a table is an abstraction of the structure of these numbers, a line chart is an abstraction of their evolution, a bar chart is an abstraction of their ranking,... when we're trying to know what messages to get from a dataset, it's usually a good idea to explore the data through different representations, and thus different level of abstraction, in order to reveal interesting patterns.
That's how you realise that a dataset like the one described in Anscomb's Quartet is actually more complex than how it looks initially.

So, if I know my use case, and if I know the type of message that I'm trying to address, the rest of the work is just about visual design right?

Visual design is not the only way to convey information and you should use all the tools available on your medium to achieve this mission.

Animations for instance can be an amazing tool.
Let's take an example with this very bad representation.

This chart is really hard to understand
This chart is really hard to understand

Beyond the difficulty to compare areas, the problem is that there is a lot of dimensions (a measure, some categories, some sub categories for each category,...), and the relationship between them is not clear.

Let's see how animation could help us here with this very simple example:

Animations convey a lot of information when they're used properly

By animating the way we show all the elements of this chart, we can help the user to understand relationships between them: we start by showing main categories from the left to the right, and then we draw sub categories inside each category from the bottom to the top.

Copywriting is also very important.
The words that you choose, the symbols that you use for your units, the way you truncate labels or round up values, the title, the labels and the values that you choose to display,... all these decisions are fundamental in the way the user will perceive your message.

Interactions are important too, particularly for data exploration. You can engage users and push them to discover more by allowing direct manipulation on the dataset.

And what about your "user's flow" ?
You don't always have to tell everything in one representation. Complicated problems can easily be broken down into smaller parts that become often easier to understand. You can spread representations either in space (show multiple representations in one page for instance), or in time (show a series of representations one after the other).
Whatever you decide, it has to be intentional and meaningful.

Tools are many, the most important thing is to be conscious of them, and to leverage them as much as possible to achieve the first use case that you defined: data exploration, or data explanation.
Because that should always be your first concern when designing something: getting clarity on what you're trying to achieve. Just like the way we switched from data driven to user driven approach in web design 15 years ago (remember old email clients that were just mirroring the structure of databases? What a journey to come to something like Google Inbox!), we need to move away from meaningless and unintentional datavizualisations.
The type of data that we want to show should only be a constraint, the use case that we want to address, and thus the user, should always be the driver of a datavizualisation.

Don't hesitate to have a look at other case studies.

Interesting links:
Journalism in the age of data
Perceptual edge
Narrative visualization : Telling stories with data
Amanda Cox