## Drawing diagrams and figures for research articles and theses

Devising an easy to understand, clear, concise diagram for your research article can be a daunting task. When your readers first approach your article, the diagrams are likely to be the first thing that they see and may cause them to make a positive or negative assessment of the worth of the paper.

I have put together here some recommendations based on my experiences with reviewing and supervising doctoral students. This text is currently in draft form—I still need to add in example images. If you have any comments, I’d really like to hear from you!

A good diagram conveys an idea or a set of ideas in a concise way. For this reason, the first sketch that you do, may not work as a good explanation of the idea. Be prepared to throw away the first version (or so) and for this reason, it is much easier to sketch out by hand before you start. Once you have a workable sketch, you’ll find it easier to lay out the text and graphics on the diagram in a neat way using a graphics package.

## Size your canvas appropriately before laying out

If you use a package such as Inkscape, it will, by default, give you an A4 page to draw upon. This leads you to draw a large diagram that covers the whole page, which is then shrunk to, say, 3 or 3.5 cms to scale it to fit into a two column document (suitable for most conferences). However, you will thus end up with tiny text, large amounts of space between boxes, thin and spiderlike lines and arrows, and large amounts of padding within the boxes around any text.

To avoid this problem, start by sizing the diagram to fit to your column width. You should aim to avoid any resizing of the diagram when importing it. If you adjust the canvas size, it is useful to leave a small (say 1mm) space, around the outside of the figure since, even if the graphical elements are positioned only inside the edge of the paper, aliasing effects can cause them to flow slightly outside, and they will look cut off if they are right on the edge.

What will happen if you don’t take this advice? You might still be able to use scaling to fit a diagram to your page. However the font sizes won’t match up. One trick to get around this is to use a different font than is used in the main body of the text. For example, use Helvetica in the diagram and Times Roman in the body text. This way, the font size mismatch will be less obvious.

Careful use of the scale transform applied to the whole diagram can also be used to adjust an existing diagram to fit in your target space. Make sure you adjust horizontal and vertical proportionally. You may need to rectify font sizes slightly afterwards (they might be 9.1 pt instead of 9 pt, e.g.)

## Use a consistent sizing of fonts and lines

Avoid having some lines thicker than others unless it was your intent to convey extra information this way. In this case, be careful that the reader understands that extra information in the way that you think she does. Try to avoid allowing lines to be too thin or thick (1 pt should generally be considered a minimum).

Avoid small font sizes: some conferences and journals explicitly request nothing smaller than 8pt in graphics.

Avoid overly large font sizes. Scaling of a small diagram can expand text – avoid this by turning off rescaling in LyX or LaTeX and by limiting the maximum font size to around 12 or less.

As a general rule, you should try to match the caption font size (typically 9pt).

## Type consistency

Aim to have a particular type of graphical element (such as an arrow, box or circle) having a consistent meaning across the whole diagram. For example, avoid using arrows in one place to indicate transfer of control and in another, transfer of information.

## Try to use standard diagrams

If you are representing data flows, use a data flow diagram (DFD). If you are representing class hierarchy, use a UML class diagram. If you can’t find a diagram style that suits, you may need to make your own but consider borrowing strong elements from existing formats.

## Be careful with resizing

If you resize text, it can make a fundamental alteration to the font – squashing horizontally and stretching vertically or vice versa. This text will look slightly wrong (but you probably won’t be able to say exactly why it’s wrong unless you look closely).

A similar problem occurs with many other things (such as the widths of lines, which will be altered by squeezing or stretching).

The solution is to completely avoid resizing using the stretch tool. Resizing of boxes and lines can be achieved by using the “edit paths” tool. Text should be resized by changing the font size.

If you have existing text that has been stretched or squashed, the simplest fix is to cut and paste the text to a new text box. You’ll probably be surprised how much it changes!

## Drawing arrows between shapes

The way that most computer programs (and most computer scientists) that draw diagrams is not, in my view, aesthetically pleasing. They tend to use the rule: draw from the centre of an edge to the centre of the target edge.

However, I (and many other people) prefer that you draw arrows aligned to a line going through the centroid (or centre of mass).

Furthermore, the eye appreciates curves rather than straight lines; so you could keep with centre edge but curve the line

But actually I think that this works poorly when there are many arrows and it is better to draw a curve centroid to centroid.

To construct this last one, you need to either use clipping (I think Inkscape might support this) or you align your arrow with a curve that is drawn centre to centre that starts out going right and ends going right. The control points need to be done by eye to make a line that you find appropriate.

## Sizing boxes with text

Generally speaking, vertical and horizontal padding between text and the edge of a box surrounding it should be (a) even above and below / left to right and (b) roughly the same between horizontal and vertical.

## Choose a good colour scheme

http://colorbrewer2.org/ provides a nice way to choose a colour scheme that is both pleasant and consistent. It also helps with producing diagrams that might also work if they are printed on a black and white printer or are viewed by people who have impaired colour vision.

## Print out and review

Many small mistakes can be spotted by printing the graphic out in the correct size (i.e., the size that it will eventually be printed at) and examining carefully. There are several things to check for:

• have boxes or lines become pixellated?
• is the text readable (too small or large)?
• is it well balanced
• are there extraneous artefacts (e.g. small graphical elements that are not supposed to be there)

## Check for spelling errors

The print out and review stage is also a good time to make sure that the spelling is correct. Although Inkscape and many other graphical programs will check for errors, they can’t spot substitutions such as “through” to “trough” or “perform” to “preform”. The only way to be sure is to read through all the text carefully.

## A check-list for graphics

The following check-list should be used to ensure your graphics are of good quality:

1. Is the graphic sized correctly for the target column size (about 3.5 cm for double column, e.g.)?
2. Are fonts consistent and sized so that they are readable?
3. Is padding around text minimal (not wasting too much space)?
4. Is the colour scheme appropriate for the use?
5. Have you printed out and reviewed on paper?
6. Spell checked?

## Key ideas

• DON’T manually transfer table values to LaTeX – DO put the data in a separate file that gets loaded during compilation.
• DON’T format your values and truncate decimal places manually – DO use a script to truncate values consistently.
• DON’T manually insert units or convert exponents – DO use siunitx to format numbers and units.
• DON’T end up with a jumble of scripts – DO tie your workflow together with a Makefile

## Introduction

Developing reproducible research is a key element in producing robust results and good science.

To achieve research that is reproducible by others, we must first be able to reproduce it ourselves. That is, be able to come back to our source files in 6 months (or more!) and re-run any part of the analysis, reproduce any graph, check the values in the tables, and generally ensure that your results were not a fluke. How important reproducibility is to you will be discipline dependent but no self-respecting researcher should be publishing a paper that has results that cannot even be reproduced given access to the original data.

One element of reproducibility, that I want to focus on here, is to produce tables in such a way that manual error is avoided and that any value in the table can be recalculated.

Manual handling of numbers is a common source of error but it needn’t be. Once scripts are set up to automatically generate tables from the source data, they are easily modified to suit the next table, the next paper, or the next project. If you are not using a tool that supports you producing your tables directly, then this will be the biggest hurdle. However, without this step, it will be hard, not just to automate your tables, but to make your work completely reproducible.

Your tools will dictate, to some extent, how easy automation is to do. Try to avoid tools that encourage manual handling, such as Excel and Word, and switch instead to tools that make automation easier, like R and Python. I also recommend that you make use of GNU Make to tie everything together and make it easy to remember what to do when you come back in 6 months time.

## Background

I don’t want to provide a detailed literature review here but I do want to make a note of some important trends in science.

1. The Open science movement is leading the way towards more transparent scientific practices. Scientists are starting to realise that the intellectual honesty that goes with open source software should also apply to their outputs. This means more than just making the output (or journal paper) freely available – it means making the source code’ of that output available, including the original data used and analysis scripts.
2. A systematic study of biomedical research in 2005 by Ioannidis found that most published research findings are false. It seems likely that the problems identified for biomedicine are worse, not better, for other disciplines.
3. A 2016 survey of 1500 scientists asked if there was a reproducibility crisis? More than half said yes’, with another third saying that there was a slight crisis. Scientists reported trouble reproducing others and many said they even had trouble reproducing their own experiments, when they attempted to do so.

The key message is that reproducibility is of fundamental importance and that we all need to work harder at enabling it for our own research.

Most researchers will manually transcribe this data into a LaTeX (or Word) document, like so:

\begin{tabular}{...}
Policy & Avg. Reward & Comfort score & Energy use (Wh)\\
bang-bang-et & $-2.82$ & $-0.72$ & $1950$ \\
bang-bang-avg & $-2.27$ & $-0.87$ & $721$ \\
...


Note how a few things needed to be manually transformed in this process.

• Numbers need to be written in math-mode (using \$ signs) to give a consistent font and to ensure that the minus signs look right.
• Some rows or columns may not be relevant (the 1.48E+12 is actually to do with the Unix clock time when the result was generated).
• The numbers need to be truncated or rounded appropriately. It’s a good question to ask “what’s appropriate?” here. If you have standard deviation or confidence intervals then round appropriately for that. Leaving in a large number of digits suggests that you don’t understand the uncertainty in your data.
• Exponents need to be translated into a printable form. Note that the siunitx package has a nice facility for doing this automatically.

With so many little details to be taken care of, automating looks hard. Fortunately, there are some cool tools to help.

If the job is a simple one (or can be made simple), try using csvsimple to load in table. I won’t describe this here but there’s lots of help on the Internet.

The csvsimple package allows you to put extra commands, such as \si{} (from siunitx), around each table entry but for specialist needs (such as, truncating numbers) you may need to write your own script that writes a tex’ file. This tex file will then need to be included into your main file with \input.

### Truncating numbers appropriately

Quoting table values to 10 decimal places is clearly not appropriate. Most experiments, if tried again, will yield slightly different values. We should aim to express numbers in a way that appropriately reflects our uncertainty about the true value.

For example, imagine that we have an experiment that involves 10 trials and we record the mean measurement value from those trials. The standard deviation provides useful information about the likely precision of the mean. Confidence intervals can often be derived from the standard deviation given the sample size and assuming a normal distribution.

As a rule of thumb, the standard deviation should be expressed to one significant figure unless the number is between 11 and 19 (times some power of ten) in which case you can use two significant figures.

The measurement value should be expressed to agree in terms of decimal places with the standard deviation.

For example, a value resulting from a spreadsheet calculation of an average and standard deviation might be 10.1298 ± 0.2595. This should be expressed as 10.1 ± 0.3 or 10.1 (0.3) where the number in parenthesis is taken to be the estimated standard deviation. The estimate indicates that the value is only known to within three tenths of a unit of measurement. The figures beyond the tenths place are not informative to the reader and should be truncated.

Note that I’m glossing over the details here and it is worth reading more about measurement uncertainty.

The following code roughly obeys the above rules. The trick is to use a nice feature of the python string formatter that allows the number of digits after the decimal point to be parameterised.

def mean_string(m, s):
return '{:6.{sig}f} ± {:.1g}'.format(m, s, sig=-int(np.floor(np.log10(s))))


>>> mean_string(0.016933, 0.005105)
' 0.017 ± 0.005'
`

The way this works is to work out the base 10 log of the s.d. This number will typically be negative (I haven’t dealt with s.d. > 1!). Taking the floor of this number will tell you how many digits after the decimal point need to be included to format the mean.

For this simple code, the s.d. is simply formatted with 1 significant figure. This might be improved by first finding out if the first two digits of the s.d. are between 11 and 19 inclusive and in that case formatting with 2 significant digits.

## Makefiles to tie together and document

I always find that when I come back to a project after leaving it for a few weeks that I cannot remember what I’ve done. Some vague recollection exists, perhaps, of scripts that process one file into another but the details and ordering of the procedure have vanished from my memory.

In theory, one might document the process. However, this still leaves you with a manual process that may get out of step with the last version of the documentation.

A better approach is to weave the documentation and the code together into a master script. GNU Make provides a simple and effective method for doing this.

I should note that GNU Make simply does not handle having spaces or special characters in filenames—don’t even try. However, it is usually not such a burden to use hypens or underscores.

An excellent tutorial on using Make for reproducible research is provided by Arnold et al.

## Conclusions and next steps

Automating your research can seem like cycling up a steep hill. Progress appears slow and it’s much harder work than usual. However, once over the hump, you’ll find the going much easier and generally much more rewarding.

In closing, I’d like to point again to documentation provided by The Turing Way Community as a good general source of information on how to make your research more reproducible. You may also want to look at an online course on reproducible science.

[1] The Turing Way Community, Becky Arnold, Louise Bowler, Sarah Gibson, Patricia Herterich, Rosie Higman, … Kirstie Whitaker. (2019, March 25). The Turing Way: A Handbook for Reproducible Data Science (Version v0.0.4). Zenodo. http://doi.org/10.5281/zenodo.3233986

## Acknowledgements

This blog post was produced using Emacs, org-mode, and org2blog.