{ "metadata": {}, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
\n\n# Single Cell Publication - Data Plotting\n\nby [Helena Rasche](https://training.galaxyproject.org/hall-of-fame/hexylena/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n\n**Objectives**\n\n\n**Time Estimation: 1H**\n
\n", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-0", "source": "

We’ll use some ggplot and tidyverse code to plot the data we collected in part 1

\n
\n
Agenda
\n

In this tutorial, we will cover:

\n
    \n
  1. Data Cleaning
  2. \n
  3. Plot: Lines added/removed by date/time period
  4. \n
  5. Contributions over time
  6. \n
  7. New X over Time
  8. \n
  9. Pageviews
  10. \n
\n
\n

We’ll use tidyverse (which includes things like magrittr (%>%) and ggplot2) to load our data. Reshape2 provides the cast/melt functions which can be used to reshape specific datasets into formats that are easier to plot.

\n
library(tidyverse)\nlibrary(reshape2)\n# This path will probably need to be changed depending on where you downloaded that dataset.\ndata = read_tsv(\"sc.tsv\")\n
\n

Data Cleaning

\n

Let’s start by cleaning our data a bit. We’ve got a couple problems with it:

\n\n
clean = data %>%\nselect(num, class, additions, deletions, mergedAt) %>%\nfilter(!is.na(mergedAt)) %>%\ngroup_by(num, class, mergedAt, month=floor_date(mergedAt, 'month'), quarter=floor_date(mergedAt, 'quarter')) %>%\nsummarise(additions=sum(additions), deletions=-sum(deletions)) %>%\nfilter(class != \"ignore\") %>%\nfilter(class != \"image\") %>%\narrange(mergedAt) %>%\nas_tibble()\n
\n

We’ll setup a ‘theme’ for our plot that mainly consists of using the black and white theme which is quite elegant and readable, and then making some font sizes a wee bit larger:

\n
theme = theme_bw() + theme(\naxis.text=element_text(size=14),\nplot.title=element_text(size=18),\naxis.title=element_text(size=16),\nlegend.title=element_text(size=14),\nlegend.text=element_text(size=14))\n
\n

We want all of our plots to look the same which is why we use this trick, it helps us keep a consistent aesthetic without having to re-type the configuration every single time.

\n

Let’s get to plotting!

\n

Plot: Lines added/removed by date/time period

\n

We have a dataset that looks like:

\n
clean %>% head(10)\n
\n

We’ll want to plot the changes versus the time (either quarter or month), and\nthen maybe plot them differently based on the class of the addition/removal. So that translates into an aesthetics statement like aes(x=time, y=additions, fill=class)

\n

In ggplot2 you can plot data some different ways, if you provide the datasets upfront, e.g. data %>% ggplot() you’ll generally do something like data %>% ggplot(aes(x=a, y=b)) + geom_something() and geom_something will take the data from the ggplot call. However if you want to plot multiple series, you can also provide the data directly to the geom_* functions like so:

\n
ggplot() +\ngeom_col(data=clean, aes(x=quarter, y=additions, fill=class)) +\ngeom_col(data=clean, aes(x=quarter, y=deletions, fill=class)) + scale_fill_brewer(palette = \"Paired\") +\ngeom_point(data=clean, aes(x=mergedAt, y=0), shape=3, alpha=0.3, color=\"black\") +\ntheme +\nxlab(\"Quarter\") + ylab(\"Lines added or removed\") + guides(fill=guide_legend(title=\"Category\")) +\nggtitle(\"Lines added or removed by file type across GTN Single Cell Learning Materials\")\nggsave(\"sc-lines-by-quarter.png\", width=12, height=5)\nggplot() +\ngeom_col(data=clean, aes(x=month, y=additions, fill=class)) +\ngeom_col(data=clean, aes(x=month, y=deletions, fill=class)) + scale_fill_brewer(palette = \"Paired\") +\ntheme +\ngeom_point(data=clean, aes(x=mergedAt, y=0), shape=3, alpha=0.3, color=\"black\") +\nxlab(\"Month\") + ylab(\"Lines added or removed\") + guides(fill=guide_legend(title=\"Category\")) +\nggtitle(\"Lines added or removed by file type across GTN Single Cell Learning Materials\")\nggsave(\"sc-lines-by-month.png\", width=12, height=5)\n
\n

\"very

\n

In this case we plotted the data for additions and deletions separately, and we additionally added points based on the actual date of the PRs to visualise their density.

\n

Let’s produce ‘s favourite running sum plots in addition. We’ll start by reshaping our data. Currently we have data that looks like:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
timevariablevalue
todaymeasure11
todaymeasure210
yesterdaymeasure130
yesterdaymeasure25
\n

And we’ll re-shape this to look like this, which will make it easier to calculate changes over the course of a specific series:

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
timemeasure1measure2
today110
yesterday305
\n

We’ll use the dcast function to do that:

\n
clean %>% select(month, class, additions)  %>% dcast(month ~ class, value.var=\"additions\") %>% head()\n
\n

That doesn’t look quite right so, let’s change how the data is aggregated:

\n
# cumulative\nclean %>% select(month, class, additions)  %>% dcast(month ~ class, value.var=\"additions\", fun.aggregate = sum)\n
\n

Let’s do it for real now:

\n
cumulative = clean %>% select(month, class, additions)  %>%\ndcast(month ~ class, value.var=\"additions\", fun.aggregate = sum) %>%\nmutate(across(bibliography:workflows, cumsum)) %>%\nreshape2::melt(id.var=\"month\")\ncumulative %>% ggplot(aes(x=month, y=value, color=variable)) + geom_line() +\ntheme_bw() + theme +\nxlab(\"Month\") + ylab(\"Lines added\") + guides(color=guide_legend(title=\"Category\")) +\nggtitle(\"Cumulative lines added by category, across GTN Single Cell materials\")\nggsave(\"sc-lines-cumulative.png\", width=12, height=8)\n
\n

\"workflows

\n

Contributions over time

\n

Let’s again start with some cleaning, namely removing all rows with 0 records, and removing future records (at the time of writing.)

\n
roles = read_tsv(\"sc-roles.tsv\")\nw = roles %>%\nfilter(count != 0) %>%\nfilter(!grepl(\"2025\", date)) %>%\nfilter(!grepl(\"2024-12-01\", date))\nw %>%\nggplot(aes(x=date, y=count, color=area)) +\ntheme +\nxlab(\"Date\") + ylab(\"Unique Contributors\") + guides(color=guide_legend(title=\"Contribution Area\")) +\nggtitle(\"Contributors over time to GTN Single Cell Materials\") +\ngeom_line()\nggsave(\"sc-contribs.png\", width=12, height=6)\n
\n

\"plot

\n

New X over Time

\n

Let’s plot all of the new single cell things added over time, all the new FAQs, Tutorials, Slides, Etc:

\n
added_by_time = read_tsv(\"single-cell-over-time.tsv\")\nadded_by_time %>% dcast(date ~ <code>type</code>) %>%\narrange(date) %>%\nmutate(across(event:workflow, cumsum)) %>%\nmelt(id.var=\"date\") %>%\nas_tibble() %>% arrange(date) %>%\nggplot(aes(x=date, y=value, color=variable)) + geom_line() +\ntheme_bw() + theme +\nxlab(\"Date\") + ylab(\"New Single Cell Items\") + guides(color=guide_legend(title=\"Contribution Type\")) +\nggtitle(\"New Single Cell events, FAQs, news, slides, tutorials, videos and workflows in the GTN\")\nggsave(\"sc-files-cumulative.png\", width=6, height=6)\n
\n

We may also want to know how many changes there have been since the start date of this study, e.g. October 1st, 2020:

\n
since_oct = added_by_time %>% dcast(date ~ <code>type</code>) %>%\narrange(date) %>%\nmutate(across(event:workflow, cumsum)) %>%\nfilter(date > as.Date(\"2020-10-01\"))\n# Pull out the first/last date as our start ane dnwend\nstart_date = (since_oct %>% head(n=1))$date\nend_date = (since_oct %>% tail(n=1))$date\n# Table of our changes.\nsince_oct %>%\nfilter(date == start_date | date == end_date) %>%  # Just those rows\nmutate(date = case_when(date == start_date ~ 'start', date == end_date ~ 'end')) %>% # Relabel the start/end as literal string start and end\npivot_longer(-date) %>% pivot_wider(names_from=date, values_from=value) %>% # Transpose the data\nmutate(increase=end - start) %>% # And calculate our increase\nselect(name, increase) %>% arrange(-increase)\n
\n

Pageviews

\n

GTN uses the Galaxy Europe Plausible server for collecting metrics (you can change your preferences in your GTN privacy preferences).\nWe can download the data from the server and plot it, filtering by our preferred start/end dates and filters (namely that page includes /topics/single-cell/).\nUnfortunately the data is downloaded as a zip file which we’ll then need to extract data from:

\n
system(\"wget 'https://plausible.galaxyproject.eu/training.galaxyproject.org/export?period=custom&date=2024-11-07&from=2022-08-01&to=2024-11-07&filters=%7B%22page%22%3A%22~%2Ftopics%2Fsingle-cell%2F%22%7D&with_imported=true&interval=date' -O sc-stats.zip\")\n
\n

We can use the unzip function to read a single file directly from the zip:

\n
views = read_csv(unzip(\"sc-stats.zip\", \"visitors.csv\"))\n
\n

With that we’re ready to plot. We’re going to use a new feature for our plot,\nannotate. Annotation allows you to draw arbitrary features atop your plot, in\nthis case we’re going to draw rectangles to indicate outages and events that\nmight have affected our data.

\n
y = 3300\nxoff = 3\nviews %>% group_by(date=floor_date(date, 'week')) %>%\nsummarise(date, visitors=sum(visitors), pageviews=sum(pageviews)) %>%\nfilter(visitors != 0 | pageviews != 0) %>%\nmelt(id.var=\"date\")  %>%\nggplot(aes(x=date, y=value, color=variable)) + geom_line() +\nannotate(\"rect\", xmin = as.Date(\"2023-10-01\"), xmax = as.Date(\"2023-10-27\"), ymin = 0, ymax = y,  alpha = .2) + # Outage\nannotate(\"rect\", xmin = as.Date(\"2023-05-22\") - xoff, xmax = as.Date(\"2023-05-26\") - xoff, ymin = 0, ymax = y,  alpha = .2) + # Smorg3\nannotate(\"rect\", xmin = as.Date(\"2024-10-07\") - xoff, xmax = as.Date(\"2024-10-11\") - xoff, ymin = 0, ymax = y,  alpha = .2) + # GTA\nannotate(\"rect\", xmin = as.Date(\"2024-09-16\") - xoff, xmax = as.Date(\"2024-09-20\") - xoff, ymin = 0, ymax = y,  alpha = .2) + # Bootcamp\ntheme_bw() + theme +\nscale_y_continuous(expand = c(0, 0), limits = c(0, NA)) +\nxlab(\"Date\") + ylab(\"Count\") + guides(color=guide_legend(title=\"Metric\")) +\nggtitle(\"GTN Single Cell Visits\")\nggsave(\"sc-pageviews.png\", width=14, height=4)\n
\n

for our lovely pageview plot!

\n

\"pageviews

\n
\n
\n

Discuss this with us and we can perhaps generalise this analysis, to reduce the amount of data processing you need to do, and make it more accessible for everyone!

\n
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/contributing/tutorials/meta-analysis-plot/tutorial.html#feedback) and check there for further resources!\n" ] } ], "etadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "r" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "4.1.0" } } }