{ "metadata": {}, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
Scientific analyses often consist of a number of tools that run one after the other, in order to go from the raw data to scientific insight. Between these specialized tools, simple data manipulation steps are often needed as a kind of “glue” between tools. For example, the output of tool A may produce a file that contains all the information needed as input for tool B, but tool B expects the columns in a different order. Or in genomic data analysis, some tools expect chromosome X to be listed as chrX
, while others simply expect X
. In these situations, extra data manipulation steps are needed to prepare files for input to analysis tools.
Galaxy has a large collection of tools to perform such basic data manipulation tasks, and becoming familiar with these operations will allow to perform your analysis more easily in Galaxy (and outside).
\n\n\nAgenda\nIn this tutorial, we will cover:
\n\n
\n- Cheatsheet
\n- Background
\n- Preamble
\n
Here is an overview table of the different data manipulations in this tutorial, with links to the tools in Galaxy.
\nOperation | \nDescription | \nGalaxy Tool | \n
---|---|---|
Compute on rows | \nto derive new column values from existing ones | \nSELECT x * 2 FROM y | \n
Concatenate datasets | \none after the other | \nSELECT * FROM x; union all; SELECT * FROM y | \n
Counting | \nCount occurrences of values in a column | \nSELECT count(x) FROM y where x = 'value' | \n
Cut Columns | \nBy header name | \nSELECT x, y, z FROM a | \n
Filter | \nRemove rows based on values in one or more columns | \n... WHERE x = 'value' | \n
Find and Replace | \nin a specific column | \nREPLACE() , regexp_replace in postgresql | \n
Group on a column | \nAnd perform simple operations (count, mean, min, max etc) | \n... GROUP BY x ... | \n
Join two Datasets | \nside by side on a specified field | \nSELECT * FROM x, y JOIN x.id = y.id | \n
Select First lines | \nGood for finding top 10s or saving header lines | \n... LIMIT 10 | \n
Sort on a column | \nChange the order of the rows based on values in one or more columns | \n... ORDER BY x ASC | \n
Unique | \nRemove duplicate rows | \nSELECT DISTINCT x FROM y | \n
In this tutorial, these functions are explained in more detail, and we provide some exercises for you to practice.
\nIn this tutorial, we will use as our dataset a table with results from the Olympics, from the games in Athens in 1896 until Tokyo in 2020. The objective is to familiarize you with a large number of the most important data manipulation tools in Galaxy. Much like the Olympics, there are many different disciplines (types of operations), and for each operation there are often multiple techniques (tools) available to athletes (data analysts, you) that are great for achieving the goal.
\n\nWe will show you many of these commonly needed data manipulation operations, and some examples of how to perform them in Galaxy. We also provide many exercises so that you can train your skills and become a data manipulation Olympian!
\n