{ "metadata": {}, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
\n\n# Data Manipulation Olympics - SQL\n\nby [Saskia Hiltemann](https://training.galaxyproject.org/hall-of-fame/shiltemann/), [Helena Rasche](https://training.galaxyproject.org/hall-of-fame/hexylena/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n- How can I do basic data manipulation in SQL?\n- Which functions are available to convert, reformat, filter, sort etc my data stored in a database?\n\n**Objectives**\n\n- Familiarize yourself with data manipulation in SQL\n- Perform basic SQL query tasks in Galaxy\n- Reason about the expected outcome of tools\n\n**Time Estimation: 1h**\n
\n", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-0", "source": "

Scientific analyses often consist of a number of tools that run one after the other, in order to go from the raw data to scientific insight. Between these specialized tools, simple data manipulation steps are often needed as a kind of “glue” between tools. For example, the output of tool A may produce a file that contains all the information needed as input for tool B, but tool B expects the columns in a different order. Or in genomic data analysis, some tools expect chromosome X to be listed as chrX, while others simply expect X. In these situations, extra data manipulation steps are needed to prepare files for input to analysis tools.

\n\n

Galaxy has a large collection of tools to perform such basic data manipulation tasks, and becoming familiar with these operations will allow to perform your analysis more easily in Galaxy (and outside).

\n
\n
Agenda
\n

In this tutorial, we will cover:

\n
    \n
  1. Cheatsheet
  2. \n
  3. Background
  4. \n
  5. Preamble
  6. \n
\n
\n

Cheatsheet

\n

Here is an overview table of the different data manipulations in this tutorial, with links to the tools in Galaxy.

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
OperationDescriptionGalaxy Tool
Compute on rowsto derive new column values from existing onesSELECT x * 2 FROM y
Concatenate datasetsone after the otherSELECT * FROM x; union all; SELECT * FROM y
CountingCount occurrences of values in a columnSELECT count(x) FROM y where x = 'value'
Cut ColumnsBy header nameSELECT x, y, z FROM a
FilterRemove rows based on values in one or more columns... WHERE x = 'value'
Find and Replacein a specific columnREPLACE(), regexp_replace in postgresql
Group on a columnAnd perform simple operations (count, mean, min, max etc)... GROUP BY x ...
Join two Datasetsside by side on a specified fieldSELECT * FROM x, y JOIN x.id = y.id
Select First linesGood for finding top 10s or saving header lines... LIMIT 10
Sort on a columnChange the order of the rows based on values in one or more columns... ORDER BY x ASC
UniqueRemove duplicate rowsSELECT DISTINCT x FROM y
\n

In this tutorial, these functions are explained in more detail, and we provide some exercises for you to practice.

\n

Background

\n

In this tutorial, we will use as our dataset a table with results from the Olympics, from the games in Athens in 1896 until Tokyo in 2020. The objective is to familiarize you with a large number of the most important data manipulation tools in Galaxy. Much like the Olympics, there are many different disciplines (types of operations), and for each operation there are often multiple techniques (tools) available to athletes (data analysts, you) that are great for achieving the goal.

\n

\"image

\n

We will show you many of these commonly needed data manipulation operations, and some examples of how to perform them in Galaxy. We also provide many exercises so that you can train your skills and become a data manipulation Olympian!

\n

Preamble

\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "# This preamble sets up the sql \"magic\" for jupyter. Use %%sql in your cells to write sql!\n", "!python3 -m pip install ipython-sql sqlalchemy\n", "!wget -c https://zenodo.org/record/6803028/files/olympics.db\n", "import sqlalchemy\n", "engine = sqlalchemy.create_engine(\"sqlite:///olympics.db\")\n", "%load_ext sql\n", "%sql sqlite:///olympics.db\n", "%config SqlMagic.displaycon=False" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ "