Wide vs. Long Data

Wide vs Long Data Formats

Overview

Wide data and long data are different formats used to store and organize data. Long data is sometimes called narrow data, stacked data, or (when when formatted appropriately, tidy data).

To understand the structure of these data formats, start by considering a sample dataset that stores, for a given year, the GDP per capita for a country (in this case, Germany):

Date

GDP per capita

1990

22,304

2020

46,749

When we decide to add the GDP of a second country, there are two strategies available to us.

In wide data format, the additional country is added as a new column:

Date

Germany

Sweden

1990

22,304

30,594

2020

46,749

52,838

In long data format, one row is added for each combination of year and country. The country labels and GDP values are stored in separate columns:

Date

Country

GDP per capita

1990

Germany

22,304

1990

Sweden

30,594

2020

Germany

46,749

2020

Sweden

52,838

Choosing a Format

Wide and long data formats cater to varying needs and scenarios:

Wide data is more intuitive for public sharing. When datasets are presented in public-facing contexts, for instance as tables in news articles or reports, wide data formats are often preferred. They display categories as separate columns, making it easier for readers to quickly grasp comparisons and relationships without requiring advanced knowledge of data structures.‍

Long data is usually better for statistical software and advanced analysis. Long data formats are highly compatible with statistical software and programming languages, such as R or Python, which often require data in this structure for functions like grouping, filtering, or summarizing. This format makes it easier to handle multiple variables, apply consistent transformations, and perform complex analyses across categories.

In Mappica, you can build datasets using either wide or long data formats, though certain formats are better suited to specific situations. Here are several factors to consider:

1. The complexity of the data: ‍Wide data is typically more suitable to smaller datasets that a dataset contains only a few series (e.g., 2–5), since editing and managing data can be easier when viewing columns side-by-side, and without the repetition of the independent variable (the "Date" column in the examples above).

2. Selection of visual elements: Many elements in Mappica are capable of using either wide or long format, but some require a particular data format. The available data formats for a particular element are displayed in the right panel, under the Dataset section.‍

3. Filtering needs: When you plan to build intricate filtering into your visualization and need multiple elements to connect to the same filter controls, long data is often the better choice. Consider an updated version of the sample dataset that stores both "GDP per capita" and "Population" data for Germany and Sweden. In long format, it might look like this:

Date

Country

GDP per Capita

Population

1990

Germany

22,304

79.43

1990

Sweden

30,594

8.56

2020

Germany

46,749

83.16

2020

Sweden

53,838

10.35

We can use this dataset to easily create a chart for GDP and another for population. We can also add filters for any of the variables. For instance, we could create a filter element that is tied to the country column and connect this to both charts. This filter lets the user toggle the visibility of countries in both charts.

Now consider the wide data equivalent:

Date

Germany GDP

Sweden GDP

Germany Pop

Sweden Pop

1990

22,304

30,594

79.43

8.56

2020

46,749

53,838

83.16

10.35

Once again we can create separate charts for both GDP and population. However, we can no longer simultaneously filter both charts using a single variable (e.g., country). In wide data format, relationships that were previously explicitly represented have been lost, and as a result the format is more limiting in terms of functionality.

Overview
Choosing a Format