Science – James Brusey

June 19, 2022January 10, 2023

Drawing diagrams and figures for research articles and theses

Devising easy to understand, clear, concise diagrams for your research article can be a daunting task. On the one hand, expressing your ideas graphically may not come naturally. On the other hand, it can be difficult to get graphic editing tools to produce the look that you want.

However, diagrams are also an important part of doing great research. When your readers first approach your article, the diagrams are likely to act as a gateway to the rest of the content. A good diagram doesn’t just take up space—it can bring the subject matter to life. It can also add a lot of character to your work and provide a personal touch that is so often missing otherwise. The care and attention to detail that you put into your diagrams will hugely impact how your paper is received.

Given the importance of this topic, I have put together some recommendations on producing diagrams based on my experiences with reviewing research papers and supervising doctoral students. This text is currently in draft form—I still need to add in some more example images. If you have any comments, I’d really like to hear from you!

Start with a rough sketch

A good diagram conveys an idea or a set of ideas in a concise way. For this reason, the first sketch that you do may not work as a good explanation of the idea. Be prepared to throw away the first version (or so) and for this reason, it is much easier to sketch out by hand before you start. Once you have a workable sketch, you’ll find it easier to lay out the text and graphics on the diagram in a neat way using a graphics package.

Figure 1: A rough sketch involving different sorts of elements (a controller is different from a load forecast, which is different again from a solar panel) and different sorts of interconnections (such as, the connection between the load controller (LC) and the main controller versus the transmission of information from the PV forecast to the energy management system).

Size your canvas appropriately before laying out

If you use a package such as Inkscape, it will, by default, give you an A4 page to draw upon. This leads you to draw a large diagram that covers the whole page, which is then shrunk to, say, 3 or 3.5 cms to scale it to fit into a two column document (suitable for most conferences). However, you will thus end up with tiny text, large amounts of space between boxes, thin and spiderlike lines and arrows, and large amounts of padding within the boxes around any text.

To avoid this problem, start by sizing the diagram to fit to your column width. You should aim to avoid any resizing of the diagram when importing it. If you adjust the canvas size, it is useful to leave a small (say 1mm) space, around the outside of the figure since, even if the graphical elements are positioned only inside the edge of the paper, aliasing effects can cause them to flow slightly outside, and they will look cut off if they are right on the edge.

What will happen if you don’t take this advice? You might still be able to use scaling to fit a diagram to your page. However the font sizes won’t match up. One trick to get around this is to use a different font than is used in the main body of the text. For example, use Helvetica in the diagram and Times Roman in the body text. This way, the font size mismatch will be less obvious.

Careful use of the scale transform applied to the whole diagram can also be used to adjust an existing diagram to fit in your target space. Make sure you adjust horizontal and vertical proportionally. You may need to rectify font sizes slightly afterwards (they might be 9.1 pt instead of 9 pt, e.g.)

Figure 2: In Inkscape, you can find the option to resize the canvas under File / Document properties... and select the Page tab.

For line drawings, output vector graphics

When you drawing is shown on the final printed page, the available final resolution may be thousands of dots per inch (DPI). For this reason, a pixel graphic that works well on the screen may look pixelated and ugly on the printed page. Once you become aware of this issue, such eyesores stick out. To avoid the problem, make sure you can output vector graphics from your drawing tool before you start.

There are a few cases where it is still a good idea to output raster images. For example, if you have a scatter plot graph that contains many thousands of individual dots, then rendering this as a vector image can slow the PDF viewer down a lot when it shows your final page. In this case, I recommend using a PNG (or raster image) instead.

If your tool doesn’t support line drawings of the quality you desire, spend some time investigating some other tools.

Selecting a suitable drawing tool

There are many excellent drawing tools are available. All seem to have strengths and weaknesses. The main point here is to not to accept the default, or at least, not to accept it without some probing. Here are a few that I’ve found useful but there are plenty more and if you have other favourites, please let me know! Note that I deliberately exclude non-free software here. For example, Microsoft Powerpoint and Microsoft Visio are well regarded but if you lose access to the license, you will no longer be able to edit your image.

Tool	Strong at	Not so good for
Inkscape	Basic line drawings, shadows	Connected elements, generating from code, accuracy
tikz	Accuracy, relative positioning, generated from code	Quick sketches
graphviz	Drawing networks of connected items	Specifying where to draw ovals or boxes or how to route lines
dia	Electronics, data flow diagrams, UML, exporting to LaTeX (pgf) code	Precise diagrams
Google docs	Including in a google slides presentation	Precise diagrams

Use a consistent sizing of fonts and lines

Avoid having some lines thicker than others unless it was your intent to convey extra information this way. In this case, be careful that the reader understands that extra information in the way that you think she does. Try to avoid allowing lines to be too thin or thick (1 pt should generally be considered a minimum).

Avoid small font sizes: some conferences and journals explicitly request nothing smaller than 8pt in graphics.

Avoid overly large font sizes. Scaling of a small diagram can expand text – avoid this by turning off rescaling in LyX or LaTeX and by limiting the maximum font size to around 12 or less.

As a general rule, you should try to match the caption font size (typically 9pt).

Type consistency

Aim to have a particular type of graphical element (such as an arrow, box or circle) having a consistent meaning across the whole diagram. For example, avoid using arrows in one place to indicate transfer of control and in another, transfer of information.

Try to use standard diagrams

If you are representing data flows, use a data flow diagram (DFD). If you are representing class hierarchy, use a UML class diagram. If you can’t find a diagram style that suits, you may need to make your own but consider borrowing strong elements from existing formats.

Be careful with resizing

If you resize text, it can make a fundamental alteration to the font – squashing horizontally and stretching vertically or vice versa. This text will look slightly wrong (but you probably won’t be able to say exactly why it’s wrong unless you look closely).

A similar problem occurs with many other things (such as the widths of lines, which will be altered by squeezing or stretching).

The solution is to completely avoid resizing using the stretch tool. Resizing of boxes and lines can be achieved by using the “edit paths” tool. Text should be resized by changing the font size.

If you have existing text that has been stretched or squashed, the simplest fix is to cut and paste the text to a new text box. You’ll probably be surprised how much it changes!

Drawing arrows between shapes

The way that most computer programs (and most computer scientists) that draw diagrams is not, in my view, aesthetically pleasing. They tend to use the rule: draw from the centre of an edge to the centre of the target edge.

However, I (and many other people) prefer that you draw arrows aligned to a line going through the centroid (or centre of mass).

Furthermore, the eye appreciates curves rather than straight lines; so you could keep with centre edge but curve the line

But actually I think that this works poorly when there are many arrows and it is better to draw a curve centroid to centroid.

To construct this last one, you need to either use clipping (I think Inkscape might support this) or you align your arrow with a curve that is drawn centre to centre that starts out going right and ends going right. The control points need to be done by eye to make a line that you find appropriate.

Figure 3: An example diagram using GraphViz (specifically, the dot program). Note that arrows meet the surface of the oval so that the line of the arrow points to the centre of the oval. Some further improvements to make here are to ensure that text does not crash into the lines or arrowheads.

Sizing boxes with text

Generally speaking, vertical and horizontal padding between text and the edge of a box surrounding it should be (a) even above and below / left to right and (b) roughly the same between horizontal and vertical.

Choose a good colour scheme

http://colorbrewer2.org/ provides a nice way to choose a colour scheme that is both pleasant and consistent. It also helps with producing diagrams that might also work if they are printed on a black and white printer or are viewed by people who have impaired colour vision.

I must admit that I always found it a bit of a pain to carefully type in the hex codes for each individual item that I wanted to colour. So I was especially pleased when I discovered the export option in colorbrewer. Note that the resulting palette will be named something like `CB_qual_Paste1_5′ and you need to restart Inkscape after moving the file into your palette directory to get it to load.

Figure 4: Colorbrewer2.org also supports exporting a palette, which can be downloaded and inserted into the appropriate directory for Inkscape (see https://inkscape-manuals.readthedocs.io/en/latest/palette.html).

Print out and review

Many small mistakes can be spotted by printing the graphic out in the correct size (i.e., the size that it will eventually be printed at) and examining carefully. There are several things to check for:

have boxes or lines become pixellated?
is the text readable (too small or large)?
is it well balanced
are there extraneous artefacts (e.g. small graphical elements that are not supposed to be there)

Check for spelling errors

The print out and review stage is also a good time to make sure that the spelling is correct. Although Inkscape and many other graphical programs will check for errors, they can’t spot substitutions such as “through” to “trough” or “perform” to “preform”. The only way to be sure is to read through all the text carefully.

Make your figure caption informative

Captions help the reader understand a diagram. However, there seems to be a trend towards making captions short and rather cryptic. For example, “System architecture” doesn’t tell you anything. Probably there are some boxes and arrows—but what do the boxes and arrows stand for? Is a box a piece of software or a physical computer or a metaphorical entity that flows over multiple devices? What do arrows really mean? Flow of information, perhaps? Direction of control? Physical connection? If colour is used, or some element is made larger or bolder, was this just a slip of the mouse or in important part of the information that was intended to be conveyed. Thus a good figure caption can help to clarify those elements that are ambiguous.

A common idiom is to start a caption with a noun phrase that identifies the thing that you are looking at. However, you shouldn’t stop there. Aim for about 3 or 4 lines of text but use more or less depending on how easily your diagram can be explained.

A check-list for graphics

The following check-list should be used to ensure your graphics are of good quality:

Is the graphic sized correctly for the target column size (about 3.5 cm for double column, e.g.)?
Are fonts consistent and sized so that they are readable?
Is padding around text minimal (not wasting too much space)?
Is the colour scheme appropriate for the use?
Have you printed out and reviewed on paper?
Spell checked?

Some examples

Figure 5: In this original version of an architecture diagram, notice how the arrow heads are unclear. The blob on the right of the diagram is supposed to represent the finite element model of the structure (a subway terminal) but this also is unclear.

Figure 6: In the revised architecture diagram, the wireless sensors are more clearly shown with antennae that actually look like antennae. Clipart from a free clipart website was used here. The finite element model is also more clearly portrayed. Unfortunately, there is an error in the clipart used for showing the computer screen—the month of October has been skipped.

July 2, 2020July 4, 2020

Making tables reproducible

Key ideas

DON’T manually transfer table values to LaTeX – DO put the data in a separate file that gets loaded during compilation.
DON’T format your values and truncate decimal places manually – DO use a script to truncate values consistently.
DON’T manually insert units or convert exponents – DO use siunitx to format numbers and units.
DON’T end up with a jumble of scripts – DO tie your workflow together with a Makefile

Introduction

Developing reproducible research is a key element in producing robust results and good science.

To achieve research that is reproducible by others, we must first be able to reproduce it ourselves. That is, be able to come back to our source files in 6 months (or more!) and re-run any part of the analysis, reproduce any graph, check the values in the tables, and generally ensure that your results were not a fluke. How important reproducibility is to you will be discipline dependent but no self-respecting researcher should be publishing a paper that has results that cannot even be reproduced given access to the original data.

One element of reproducibility, that I want to focus on here, is to produce tables in such a way that manual error is avoided and that any value in the table can be recalculated.

Manual handling of numbers is a common source of error but it needn’t be. Once scripts are set up to automatically generate tables from the source data, they are easily modified to suit the next table, the next paper, or the next project. If you are not using a tool that supports you producing your tables directly, then this will be the biggest hurdle. However, without this step, it will be hard, not just to automate your tables, but to make your work completely reproducible.

Your tools will dictate, to some extent, how easy automation is to do. Try to avoid tools that encourage manual handling, such as Excel and Word, and switch instead to tools that make automation easier, like R and Python. I also recommend that you make use of GNU Make to tie everything together and make it easy to remember what to do when you come back in 6 months time.

Background

I don’t want to provide a detailed literature review here but I do want to make a note of some important trends in science.

The Open science movement is leading the way towards more transparent scientific practices. Scientists are starting to realise that the intellectual honesty that goes with open source software should also apply to their outputs. This means more than just making the output (or journal paper) freely available – it means making the `source code’ of that output available, including the original data used and analysis scripts.
A systematic study of biomedical research in 2005 by Ioannidis found that most published research findings are false. It seems likely that the problems identified for biomedicine are worse, not better, for other disciplines.
A 2016 survey of 1500 scientists asked if there was a reproducibility crisis? More than half said `yes’, with another third saying that there was a slight crisis. Scientists reported trouble reproducing others and many said they even had trouble reproducing their own experiments, when they attempted to do so.

The key message is that reproducibility is of fundamental importance and that we all need to work harder at enabling it for our own research.

Automating the process starting with table loading

Figure 1: Example results file to be converted into a table

Most researchers will manually transcribe this data into a LaTeX (or Word) document, like so:

\begin{tabular}{...}
Policy & Avg. Reward & Comfort score & Energy use (Wh)\\
bang-bang-et & $-2.82$ & $-0.72$ & $1950$ \\
bang-bang-avg & $-2.27$ & $-0.87$ & $721$ \\
...

Note how a few things needed to be manually transformed in this process.

Numbers need to be written in math-mode (using $ signs) to give a consistent font and to ensure that the minus signs look right.
Some rows or columns may not be relevant (the 1.48E+12 is actually to do with the Unix clock time when the result was generated).
The numbers need to be truncated or rounded appropriately. It’s a good question to ask “what’s appropriate?” here. If you have standard deviation or confidence intervals then round appropriately for that. Leaving in a large number of digits suggests that you don’t understand the uncertainty in your data.
Exponents need to be translated into a printable form. Note that the siunitx package has a nice facility for doing this automatically.

With so many little details to be taken care of, automating looks hard. Fortunately, there are some cool tools to help.

If the job is a simple one (or can be made simple), try using csvsimple to load in table. I won’t describe this here but there’s lots of help on the Internet.

The csvsimple package allows you to put extra commands, such as \si{} (from siunitx), around each table entry but for specialist needs (such as, truncating numbers) you may need to write your own script that writes a `tex’ file. This tex file will then need to be included into your main file with \input.

Truncating numbers appropriately

Quoting table values to 10 decimal places is clearly not appropriate. Most experiments, if tried again, will yield slightly different values. We should aim to express numbers in a way that appropriately reflects our uncertainty about the true value.

For example, imagine that we have an experiment that involves 10 trials and we record the mean measurement value from those trials. The standard deviation provides useful information about the likely precision of the mean. Confidence intervals can often be derived from the standard deviation given the sample size and assuming a normal distribution.

As a rule of thumb, the standard deviation should be expressed to one significant figure unless the number is between 11 and 19 (times some power of ten) in which case you can use two significant figures.

The measurement value should be expressed to agree in terms of decimal places with the standard deviation.

For example, a value resulting from a spreadsheet calculation of an average and standard deviation might be 10.1298 ± 0.2595. This should be expressed as 10.1 ± 0.3 or 10.1 (0.3) where the number in parenthesis is taken to be the estimated standard deviation. The estimate indicates that the value is only known to within three tenths of a unit of measurement. The figures beyond the tenths place are not informative to the reader and should be truncated.

Note that I’m glossing over the details here and it is worth reading more about measurement uncertainty.

The following code roughly obeys the above rules. The trick is to use a nice feature of the python string formatter that allows the number of digits after the decimal point to be parameterised.

def mean_string(m, s):
    return '{:6.{sig}f} ± {:.1g}'.format(m, s, sig=-int(np.floor(np.log10(s))))

>>> mean_string(0.016933, 0.005105)
' 0.017 ± 0.005'

The way this works is to work out the base 10 log of the s.d. This number will typically be negative (I haven’t dealt with s.d. > 1!). Taking the floor of this number will tell you how many digits after the decimal point need to be included to format the mean.

For this simple code, the s.d. is simply formatted with 1 significant figure. This might be improved by first finding out if the first two digits of the s.d. are between 11 and 19 inclusive and in that case formatting with 2 significant digits.

Makefiles to tie together and document

I always find that when I come back to a project after leaving it for a few weeks that I cannot remember what I’ve done. Some vague recollection exists, perhaps, of scripts that process one file into another but the details and ordering of the procedure have vanished from my memory.

In theory, one might document the process. However, this still leaves you with a manual process that may get out of step with the last version of the documentation.

A better approach is to weave the documentation and the code together into a master script. GNU Make provides a simple and effective method for doing this.

I should note that GNU Make simply does not handle having spaces or special characters in filenames—don’t even try. However, it is usually not such a burden to use hypens or underscores.

An excellent tutorial on using Make for reproducible research is provided by Arnold et al.

Conclusions and next steps

Automating your research can seem like cycling up a steep hill. Progress appears slow and it’s much harder work than usual. However, once over the hump, you’ll find the going much easier and generally much more rewarding.

In closing, I’d like to point again to documentation provided by The Turing Way Community as a good general source of information on how to make your research more reproducible. You may also want to look at an online course on reproducible science.

[1] The Turing Way Community, Becky Arnold, Louise Bowler, Sarah Gibson, Patricia Herterich, Rosie Higman, … Kirstie Whitaker. (2019, March 25). The Turing Way: A Handbook for Reproducible Data Science (Version v0.0.4). Zenodo. http://doi.org/10.5281/zenodo.3233986

Acknowledgements

This blog post was produced using Emacs, org-mode, and org2blog.