{
  "metadata": {},
  "nbformat": 4,
  "nbformat_minor": 5,
  "cells": [
    {
      "id": "metadata",
      "cell_type": "markdown",
      "source": "<div style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em; padding: 0.5em;\">\n\n# Introduction to SQL\n\nby [The Carpentries](https://training.galaxyproject.org/hall-of-fame/carpentries/), [Helena Rasche](https://training.galaxyproject.org/hall-of-fame/hexylena/), [Donny Vrins](https://training.galaxyproject.org/hall-of-fame/dirowa/), [Bazante Sanders](https://training.galaxyproject.org/hall-of-fame/bazante1/), [Avans Hogeschool](https://training.galaxyproject.org/hall-of-fame/avans-atgm/)\n\nCC-BY licensed content from the [Galaxy Training Network](https://training.galaxyproject.org/)\n\n**Objectives**\n\n- How can I get data from a database?\n- How can I sort a query's results?\n- How can I remove duplicate values from a query's results?\n- How can I select subsets of data?\n- How can I calculate new values on the fly?\n- How do databases represent missing information?\n- What special handling does missing information require?\n\n**Objectives**\n\n- Explain the difference between a table, a record, and a field.\n- Explain the difference between a database and a database manager.\n- Write a query to select all values for specific fields from a single table.\n- Write queries that display results in a particular order.\n- Write queries that eliminate duplicate values from data.\n- Write queries that select records that satisfy user-specified conditions.\n- Explain the order in which the clauses in a query are executed.\n- Write queries that calculate new values for each selected record.\n- Explain how databases represent missing information.\n- Explain the three-valued logic databases use when manipulating missing information.\n- Write queries that handle missing information correctly.\n\n**Time Estimation: 3H**\n</div>\n",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-0",
      "source": "<p>This tutorial will introduce you to Structured Query Language (SQL) which can be used to query databases!</p>\n<blockquote class=\"comment\" style=\"border: 2px solid #ffecc1; margin: 1em 0.2em\">\n<div class=\"box-title comment-title\" id=\"comment\"><i class=\"far fa-comment-dots\" aria-hidden=\"true\" ></i> Comment</div>\n<p>This tutorial is <strong>significantly</strong> based on <a href=\"https://carpentries.org\">the Carpentries</a> <a href=\"https://github.com/swcarpentry/sql-novice-survey/\">Databases and SQL</a> lesson, which is licensed CC-BY 4.0.</p>\n<p>Abigail Cabunoc and Sheldon McKay (eds): “Software Carpentry: Using Databases and SQL.”  Version 2017.08, August 2017,\n<a href=\"https://github.com/swcarpentry/sql-novice-survey\">github.com/swcarpentry/sql-novice-survey</a>, <a href=\"https://doi.org/10.5281/zenodo.838776\">https://doi.org/10.5281/zenodo.838776</a></p>\n<p>Adaptations have been made to make this work better in a GTN/Galaxy environment.</p>\n</blockquote>\n<blockquote class=\"agenda\" style=\"border: 2px solid #86D486;display: none; margin: 1em 0.2em\">\n<div class=\"box-title agenda-title\" id=\"agenda\">Agenda</div>\n<p>In this tutorial, we will cover:</p>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-1",
      "source": [
        "# This preamble sets up the sql \"magic\" for jupyter. Use %%sql in your cells to write sql!\n",
        "!python3 -m pip install ipython-sql sqlalchemy\n",
        "!wget -c http://swcarpentry.github.io/sql-novice-survey/files/survey.db\n",
        "import sqlalchemy\n",
        "engine = sqlalchemy.create_engine(\"sqlite:///survey.db\")\n",
        "%load_ext sql\n",
        "%sql sqlite:///survey.db\n",
        "%config SqlMagic.displaycon=False"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-2",
      "source": "<h1 id=\"selecting-data\">Selecting Data</h1>\n<p>A relational database\nis a way to store and manipulate information.\nDatabases are arranged as table.\nEach table has columns (also known as fields) that describe the data,\nand rows (also known as records) which contain the data.</p>\n<p>When we are using a spreadsheet,\nwe put formulas into cells to calculate new values based on old ones.\nWhen we are using a database,\nwe send commands\n(usually called queries)\nto a database manager:\na program that manipulates the database for us.\nThe database manager does whatever lookups and calculations the query specifies,\nreturning the results in a tabular form\nthat we can then use as a starting point for further queries.</p>\n<p>Queries are written in a language called <abbr title=\"Structured Query Language\">SQL</abbr>,\nSQL provides hundreds of different ways to analyze and recombine data.\nWe will only look at a handful of queries,\nbut that handful accounts for most of what scientists do.</p>\n<blockquote class=\"tip\" style=\"border: 2px solid #FFE19E; margin: 1em 0.2em\">\n<div class=\"box-title tip-title\" id=\"tip-changing-database-managers\"><button class=\"gtn-boxify-button tip\" type=\"button\" aria-controls=\"tip-changing-database-managers\" aria-expanded=\"true\"><i class=\"far fa-lightbulb\" aria-hidden=\"true\" ></i> <span>Tip: Changing Database Managers</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>Many database managers — Oracle,\nIBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite —  understand\nSQL but each stores data in a different way,\nso a database created with one cannot be used directly by another.\nHowever, every database manager\ncan import and export data in a variety of formats like .csv, SQL,\nso it <em>is</em> possible to move information from one to another.</p>\n</blockquote>\n<p>Before we get into using <abbr title=\"Structured Query Language\">SQL</abbr> to select the data, let’s take a look at the tables of the database we will use in our examples:</p>\n<p><strong>Person</strong>: people who took readings.</p>\n<table>\n<thead>\n<tr>\n<th>id</th>\n<th>personal</th>\n<th>family</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>dyer</td>\n<td>William</td>\n<td>Dyer</td>\n</tr>\n<tr>\n<td>pb</td>\n<td>Frank</td>\n<td>Pabodie</td>\n</tr>\n<tr>\n<td>lake</td>\n<td>Anderson</td>\n<td>Lake</td>\n</tr>\n<tr>\n<td>roe</td>\n<td>Valentina</td>\n<td>Roerich</td>\n</tr>\n<tr>\n<td>danforth</td>\n<td>Frank</td>\n<td>Danforth</td>\n</tr>\n</tbody>\n</table>\n<p><strong>Site</strong>: locations where readings were taken.</p>\n<table>\n<thead>\n<tr>\n<th>name</th>\n<th>lat</th>\n<th>long</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>DR-1</td>\n<td>-49.85</td>\n<td>-128.57</td>\n</tr>\n<tr>\n<td>DR-3</td>\n<td>-47.15</td>\n<td>-126.72</td>\n</tr>\n<tr>\n<td>MSK-4</td>\n<td>-48.87</td>\n<td>-123.4</td>\n</tr>\n</tbody>\n</table>\n<p><strong>Visited</strong>: when readings were taken at specific sites.</p>\n<table>\n<thead>\n<tr>\n<th>id</th>\n<th>site</th>\n<th>dated</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>619</td>\n<td>DR-1</td>\n<td>1927-02-08</td>\n</tr>\n<tr>\n<td>622</td>\n<td>DR-1</td>\n<td>1927-02-10</td>\n</tr>\n<tr>\n<td>734</td>\n<td>DR-3</td>\n<td>1930-01-07</td>\n</tr>\n<tr>\n<td>735</td>\n<td>DR-3</td>\n<td>1930-01-12</td>\n</tr>\n<tr>\n<td>751</td>\n<td>DR-3</td>\n<td>1930-02-26</td>\n</tr>\n<tr>\n<td>752</td>\n<td>DR-3</td>\n<td>None</td>\n</tr>\n<tr>\n<td>837</td>\n<td>MSK-4</td>\n<td>1932-01-14</td>\n</tr>\n<tr>\n<td>844</td>\n<td>DR-1</td>\n<td>1932-03-22</td>\n</tr>\n</tbody>\n</table>\n<p><strong>Survey</strong>: the actual readings.  The field <code style=\"color: inherit\">quant</code> is short for quantitative and indicates what is being measured.  Values are <code style=\"color: inherit\">rad</code>, <code style=\"color: inherit\">sal</code>, and <code style=\"color: inherit\">temp</code> referring to ‘radiation’, ‘salinity’ and ‘temperature’, respectively.</p>\n<table>\n<thead>\n<tr>\n<th>taken</th>\n<th>person</th>\n<th>quant</th>\n<th>reading</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>619</td>\n<td>dyer</td>\n<td>rad</td>\n<td>9.82</td>\n</tr>\n<tr>\n<td>619</td>\n<td>dyer</td>\n<td>sal</td>\n<td>0.13</td>\n</tr>\n<tr>\n<td>622</td>\n<td>dyer</td>\n<td>rad</td>\n<td>7.8</td>\n</tr>\n<tr>\n<td>622</td>\n<td>dyer</td>\n<td>sal</td>\n<td>0.09</td>\n</tr>\n<tr>\n<td>734</td>\n<td>pb</td>\n<td>rad</td>\n<td>8.41</td>\n</tr>\n<tr>\n<td>734</td>\n<td>lake</td>\n<td>sal</td>\n<td>0.05</td>\n</tr>\n<tr>\n<td>734</td>\n<td>pb</td>\n<td>temp</td>\n<td>-21.5</td>\n</tr>\n<tr>\n<td>735</td>\n<td>pb</td>\n<td>rad</td>\n<td>7.22</td>\n</tr>\n<tr>\n<td>735</td>\n<td>None</td>\n<td>sal</td>\n<td>0.06</td>\n</tr>\n<tr>\n<td>735</td>\n<td>None</td>\n<td>temp</td>\n<td>-26.0</td>\n</tr>\n<tr>\n<td>751</td>\n<td>pb</td>\n<td>rad</td>\n<td>4.35</td>\n</tr>\n<tr>\n<td>751</td>\n<td>pb</td>\n<td>temp</td>\n<td>-18.5</td>\n</tr>\n<tr>\n<td>751</td>\n<td>lake</td>\n<td>sal</td>\n<td>0.1</td>\n</tr>\n<tr>\n<td>752</td>\n<td>lake</td>\n<td>rad</td>\n<td>2.19</td>\n</tr>\n<tr>\n<td>752</td>\n<td>lake</td>\n<td>sal</td>\n<td>0.09</td>\n</tr>\n<tr>\n<td>752</td>\n<td>lake</td>\n<td>temp</td>\n<td>-16.0</td>\n</tr>\n<tr>\n<td>752</td>\n<td>roe</td>\n<td>sal</td>\n<td>41.6</td>\n</tr>\n<tr>\n<td>837</td>\n<td>lake</td>\n<td>rad</td>\n<td>1.46</td>\n</tr>\n<tr>\n<td>837</td>\n<td>lake</td>\n<td>sal</td>\n<td>0.21</td>\n</tr>\n<tr>\n<td>837</td>\n<td>roe</td>\n<td>sal</td>\n<td>22.5</td>\n</tr>\n<tr>\n<td>844</td>\n<td>roe</td>\n<td>rad</td>\n<td>11.25</td>\n</tr>\n</tbody>\n</table>\n<p>Notice that three entries — one in the <code style=\"color: inherit\">Visited</code> table,\nand two in the <code style=\"color: inherit\">Survey</code> table — don’t contain any actual\ndata, but instead have a special <code style=\"color: inherit\">None</code> entry:\nwe’ll return to these missing values.</p>\n<p>For now,\nlet’s write an SQL query that displays scientists’ names.\nWe do this using the SQL command <code style=\"color: inherit\">SELECT</code>,\ngiving it the names of the columns we want and the table we want them from.\nOur query and its output look like this:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-3",
      "source": [
        "%%sql\n",
        "SELECT family, personal FROM Person;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-4",
      "source": "<p>The semicolon at the end of the query\ntells the database manager that the query is complete and ready to run.\nWe have written our commands in upper case and the names for the table and columns\nin lower case,\nbut we don’t have to:\nas the example below shows,\nSQL is case insensitive.</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-5",
      "source": [
        "%%sql\n",
        "SeLeCt FaMiLy, PeRsOnAl FrOm PeRsOn;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-6",
      "source": "<p>You can use SQL’s case insensitivity to your advantage. For instance,\nsome people choose to write SQL keywords (such as <code style=\"color: inherit\">SELECT</code> and <code style=\"color: inherit\">FROM</code>)\nin capital letters and <strong>field</strong> and <strong>table</strong> names in lower\ncase. This can make it easier to locate parts of an SQL statement. For\ninstance, you can scan the statement, quickly locate the prominent\n<code style=\"color: inherit\">FROM</code> keyword and know the table name follows.  Whatever casing\nconvention you choose, please be consistent: complex queries are hard\nenough to read without the extra cognitive load of random\ncapitalization.  One convention is to use UPPER CASE for SQL\nstatements, to distinguish them from tables and column names. This is\nthe convention that we will use for this lesson.</p>\n<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-is-a-personal-and-family-name-column-a-good-design\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: Is a personal and family name column a good design?</div>\n<p>If you were tasked with designing a database to store this same data, is storing the name data in\nthis way the best way to do it? Why or why not?</p>\n<p>Can you think of any names that would be difficult to enter in such a schema?</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>No, it is generally not. There are a lot of <a href=\"https://shinesolutions.com/2018/01/08/falsehoods-programmers-believe-about-names-with-examples/\">falsehoods that programmers believe about names</a>.\nThe situation is much more complex as you can read in that article, but names vary wildly and\ngenerally placing constraints on how names are entered is only likely to frustrate you or your\nusers later on when they need to enter data into that database.</p>\n<p>In general you should consider using a single text field for the name and allowing users to\nspecify them as whatever they like (if it is a system with registration), or asking what they\nwish to be recorded (if you are doing this sort of data collection).</p>\n<p>If you are doing scientific research, you might know that names are generally very poor\nidentifiers of a single human, and in that case consider recording their\n<a href=\"https://orcid.org/\">ORCiD</a> which will help you reference that individual when you are\npublishing later.</p>\n<p>This is also a good time to consider what data you really <em>need</em> to collect. If you are working\nin the EU under GDPR, do you really need their full legal name? Is that necessary? Do you have a\nplan for ensuring that data is correct when publishing, if any part of their name has changed\nsince?</p>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-7",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-8",
      "source": "<p>While we are on the topic of SQL’s syntax, one aspect of SQL’s syntax\nthat can frustrate novices and experts alike is forgetting to finish a\ncommand with <code style=\"color: inherit\">;</code> (semicolon).  When you press enter for a command\nwithout adding the <code style=\"color: inherit\">;</code> to the end, it can look something like this:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT id FROM Person\n...&gt;\n...&gt;\n</code></pre></div></div>\n<p>This is SQL’s prompt, where it is waiting for additional commands or\nfor a <code style=\"color: inherit\">;</code> to let SQL know to finish.  This is easy to fix!  Just type\n<code style=\"color: inherit\">;</code> and press enter!</p>\n<p>Now, going back to our query,\nit’s important to understand that\nthe rows and columns in a database table aren’t actually stored in any particular order.\nThey will always be <em>displayed</em> in some order,\nbut we can control that in various ways.\nFor example,\nwe could swap the columns in the output by writing our query as:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-9",
      "source": [
        "%%sql\n",
        "SELECT personal, family FROM Person;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-10",
      "source": "<p>or even repeat columns:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-11",
      "source": [
        "%%sql\n",
        "SELECT id, id, id FROM Person;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-12",
      "source": "<p>As a shortcut,\nwe can select all of the columns in a table using <code style=\"color: inherit\">*</code>:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-13",
      "source": [
        "%%sql\n",
        "SELECT * FROM Person;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-14",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-selecting-site-names\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: Selecting Site Names</div>\n<p>Write a query that selects only the <code style=\"color: inherit\">name</code> column from the <code style=\"color: inherit\">Site</code> table.</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution-1\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution-1\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT name FROM Site;\n</code></pre></div>    </div>\n<table>\n<thead>\n<tr>\n<th>name</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>DR-1</td>\n</tr>\n<tr>\n<td>DR-3</td>\n</tr>\n<tr>\n<td>MSK-4</td>\n</tr>\n</tbody>\n</table>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-15",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-16",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-query-style\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: Query Style</div>\n<p>Many people format queries as:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT personal, family FROM person;\n</code></pre></div>  </div>\n<p>or as:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">select Personal, Family from PERSON;\n</code></pre></div>  </div>\n<p>What style do you find easiest to read, and why?</p>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-17",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-18",
      "source": "<h1 id=\"sorting-and-removing-duplicates\">Sorting and Removing Duplicates</h1>\n<p>In beginning our examination of the Antarctic data, we want to know:</p>\n<ul>\n<li>what kind of quantity measurements were taken at each site;</li>\n<li>which scientists took measurements on the expedition;</li>\n</ul>\n<p>To determine which measurements were taken at each site,\nwe can examine the <code style=\"color: inherit\">Survey</code> table.\nData is often redundant,\nso queries often return redundant information.\nFor example,\nif we select the quantities that have been measured\nfrom the <code style=\"color: inherit\">Survey</code> table,\nwe get this:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-19",
      "source": [
        "%%sql\n",
        "SELECT quant FROM Survey;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-20",
      "source": "<p>This result makes it difficult to see all of the different types of\n<code style=\"color: inherit\">quant</code> in the Survey table.  We can eliminate the redundant output to\nmake the result more readable by adding the <code style=\"color: inherit\">DISTINCT</code> keyword to our\nquery:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-21",
      "source": [
        "%%sql\n",
        "SELECT DISTINCT quant FROM Survey;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-22",
      "source": "<p>If we want to determine which visit (stored in the <code style=\"color: inherit\">taken</code> column)\nhave which <code style=\"color: inherit\">quant</code> measurement,\nwe can use the <code style=\"color: inherit\">DISTINCT</code> keyword on multiple columns.\nIf we select more than one column,\ndistinct <em>sets</em> of values are returned\n(in this case <em>pairs</em>, because we are selecting two columns):</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-23",
      "source": [
        "%%sql\n",
        "SELECT DISTINCT taken, quant FROM Survey;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-24",
      "source": "<p>Notice in both cases that duplicates are removed\neven if the rows they come from didn’t appear to be adjacent in the database table.</p>\n<p>Our next task is to identify the scientists on the expedition by looking at the <code style=\"color: inherit\">Person</code> table.\nAs we mentioned earlier,\ndatabase records are not stored in any particular order.\nThis means that query results aren’t necessarily sorted,\nand even if they are,\nwe often want to sort them in a different way,\ne.g., by their identifier instead of by their personal name.\nWe can do this in SQL by adding an <code style=\"color: inherit\">ORDER BY</code> clause to our query:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-25",
      "source": [
        "%%sql\n",
        "SELECT * FROM Person ORDER BY id;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-26",
      "source": "<table>\n<thead>\n<tr>\n<th>id</th>\n<th>personal</th>\n<th>family</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>danfort</td>\n<td>Frank</td>\n<td>Danforth</td>\n</tr>\n<tr>\n<td>dyer</td>\n<td>William</td>\n<td>Dyer</td>\n</tr>\n<tr>\n<td>lake</td>\n<td>Anderson</td>\n<td>Lake</td>\n</tr>\n<tr>\n<td>pb</td>\n<td>Frank</td>\n<td>Pabodie</td>\n</tr>\n<tr>\n<td>roe</td>\n<td>Valentina</td>\n<td>Roerich</td>\n</tr>\n</tbody>\n</table>\n<p>By default, when we use <code style=\"color: inherit\">ORDER BY</code>,\nresults are sorted in ascending order of the column we specify\n(i.e.,\nfrom least to greatest).</p>\n<p>We can sort in the opposite order using <code style=\"color: inherit\">DESC</code> (for “descending”):</p>\n<blockquote class=\"tip\" style=\"border: 2px solid #FFE19E; margin: 1em 0.2em\">\n<div class=\"box-title tip-title\" id=\"tip-a-note-on-ordering\"><button class=\"gtn-boxify-button tip\" type=\"button\" aria-controls=\"tip-a-note-on-ordering\" aria-expanded=\"true\"><i class=\"far fa-lightbulb\" aria-hidden=\"true\" ></i> <span>Tip: A note on ordering</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>While it may look that the records are consistent every time we ask for them in this lesson, that is because no one has changed or modified any of the data so far. Remember to use <code style=\"color: inherit\">ORDER BY</code> if you want the rows returned to have any sort of consistent or predictable order.</p>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-27",
      "source": [
        "%%sql\n",
        "SELECT * FROM person ORDER BY id DESC;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-28",
      "source": "<p>(And if we want to make it clear that we’re sorting in ascending order,\nwe can use <code style=\"color: inherit\">ASC</code> instead of <code style=\"color: inherit\">DESC</code>.)</p>\n<p>In order to look at which scientist measured quantities during each visit,\nwe can look again at the <code style=\"color: inherit\">Survey</code> table.\nWe can also sort on several fields at once.\nFor example,\nthis query sorts results first in ascending order by <code style=\"color: inherit\">taken</code>,\nand then in descending order by <code style=\"color: inherit\">person</code>\nwithin each group of equal <code style=\"color: inherit\">taken</code> values:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-29",
      "source": [
        "%%sql\n",
        "SELECT taken, person, quant FROM Survey ORDER BY taken ASC, person DESC;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-30",
      "source": "<p>This query gives us a good idea of which scientist was involved in which visit,\nand what measurements they performed during the visit.</p>\n<p>Looking at the table, it seems like some scientists specialized in\ncertain kinds of measurements.  We can examine which scientists\nperformed which measurements by selecting the appropriate columns and\nremoving duplicates.</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-31",
      "source": [
        "%%sql\n",
        "SELECT DISTINCT quant, person FROM Survey ORDER BY quant ASC;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-32",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-finding-distinct-dates\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: Finding Distinct Dates</div>\n<p>Write a query that selects distinct dates from the <code style=\"color: inherit\">Visited</code> table.</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution-2\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution-2\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT DISTINCT dated FROM Visited;\n</code></pre></div>    </div>\n<table>\n<thead>\n<tr>\n<th>dated</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>1927-02-08</td>\n</tr>\n<tr>\n<td>1927-02-10</td>\n</tr>\n<tr>\n<td>1930-01-07</td>\n</tr>\n<tr>\n<td>1930-01-12</td>\n</tr>\n<tr>\n<td>1930-02-26</td>\n</tr>\n<tr>\n<td> </td>\n</tr>\n<tr>\n<td>1932-01-14</td>\n</tr>\n<tr>\n<td>1932-03-22</td>\n</tr>\n</tbody>\n</table>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-33",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-34",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-displaying-full-names\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: Displaying Full Names</div>\n<p>Write a query that displays the full names of the scientists in the <code style=\"color: inherit\">Person</code> table,\nordered by family name.</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution-3\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution-3\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT personal, family FROM Person ORDER BY family ASC;\n</code></pre></div>    </div>\n<table>\n<thead>\n<tr>\n<th>personal</th>\n<th>family</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>Frank</td>\n<td>Danforth</td>\n</tr>\n<tr>\n<td>William</td>\n<td>Dyer</td>\n</tr>\n<tr>\n<td>Anderson</td>\n<td>Lake</td>\n</tr>\n<tr>\n<td>Frank</td>\n<td>Pabodie</td>\n</tr>\n<tr>\n<td>Valentina</td>\n<td>Roerich</td>\n</tr>\n</tbody>\n</table>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-35",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-36",
      "source": "<blockquote class=\"tip\" style=\"border: 2px solid #FFE19E; margin: 1em 0.2em\">\n<div class=\"box-title tip-title\" id=\"tip-is-sorting-names-useful\"><button class=\"gtn-boxify-button tip\" type=\"button\" aria-controls=\"tip-is-sorting-names-useful\" aria-expanded=\"true\"><i class=\"far fa-lightbulb\" aria-hidden=\"true\" ></i> <span>Tip: Is sorting names useful?</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>If you are someone with a name which falls at the end of the alphabet, you’ve likely been\npenalised for this your entire life. Alphabetically sorting names should always be looked at\ncritically and through a lens to whether you are fairly reflecting everyone’s contributions,\nrather than just the default sort order.</p>\n<p>There are many options, either by some metric of contribution that everyone could agree on, or\nbetter, consider random sorting, like the GTN uses with our <a href=\"{% link hall-of-fame.md %}\">Hall of Fame</a>\npage where we intentionally order randomly to tell contributors that no one persons\ncontributions matter more than anothers.</p>\n<blockquote class=\"quote\">\n<p>The evidence provided in a variety of studies leaves no doubt that an\nalphabetical author ordering norm disadvantages researchers with\nlast names toward the end of the alphabet. There is furthermore con-\nvincing evidence that researchers are aware of this and that they\nreact strategically to such alphabetical discrimination, for example\nwith their choices of who to collaborate with. See {% cite Weber_2018 %} for more.</p>\n</blockquote>\n</blockquote>\n<blockquote class=\"tip\" style=\"border: 2px solid #FFE19E; margin: 1em 0.2em\">\n<div class=\"box-title tip-title\" id=\"tip-name-collation\"><button class=\"gtn-boxify-button tip\" type=\"button\" aria-controls=\"tip-name-collation\" aria-expanded=\"true\"><i class=\"far fa-lightbulb\" aria-hidden=\"true\" ></i> <span>Tip: Name collation</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>When you are sorting things in SQL, you need to be aware of something called collation which can\naffect your results if you have values that are not the letters A-Z. Collating is the process of\nsorting values, and this affects many human languages when storing data in a database.</p>\n<p>Here is a Dutch example. In the old days their alphabet contained a <code style=\"color: inherit\">ÿ</code> which was later replaced\nwith <code style=\"color: inherit\">ĳ</code>, a digraph of two characters squished together. This is commonly rendered as <code style=\"color: inherit\">ij</code>\nhowever, two separate characters, due to the internet and widespread use of keyboards featuring\nmainly ascii characters. However, it is still the 25th letter of their alphabet.</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">sqlite&gt; create table nl(value text);\nsqlite&gt; insert into nl values ('appel'), ('beer'), ('index'), ('ijs'), ('jammer'), ('winkel'), ('zon');\nsqlite&gt; select * from nl order by value;\nappel\nbeer\nindex\nijs\njammer\nwinkel\nzon\n</code></pre></div>  </div>\n<p>Find a dutch friend and ask them if this is the correct order for this list. Unfortunately it\nisn’t. Even though it is <code style=\"color: inherit\">ij</code> as two separate characters, it should be sorted as if it was <code style=\"color: inherit\">ĳ</code> or\n<code style=\"color: inherit\">ÿ</code>, before <code style=\"color: inherit\">z</code>. Like so: appel, beer, index, jammer, winkel, ijs, zon</p>\n<p>While there is not much you can do about it now (you’re just beginning!) it is something you\nshould be aware of. When you later need to know about this, you will find the term ‘collation’\nuseful, and you’ll find the procedure is different for every database engine.</p>\n</blockquote>\n<h1 id=\"filtering\">Filtering</h1>\n<p>One of the most powerful features of a database is\nthe ability to filter data,\ni.e.,\nto select only those records that match certain criteria.\nFor example,\nsuppose we want to see when a particular site was visited.\nWe can select these records from the <code style=\"color: inherit\">Visited</code> table\nby using a <code style=\"color: inherit\">WHERE</code> clause in our query:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-37",
      "source": [
        "%%sql\n",
        "SELECT * FROM Visited WHERE site = 'DR-1';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-38",
      "source": "<p>The database manager executes this query in two stages.\nFirst,\nit checks at each row in the <code style=\"color: inherit\">Visited</code> table\nto see which ones satisfy the <code style=\"color: inherit\">WHERE</code>.\nIt then uses the column names following the <code style=\"color: inherit\">SELECT</code> keyword\nto determine which columns to display.</p>\n<p>This processing order means that\nwe can filter records using <code style=\"color: inherit\">WHERE</code>\nbased on values in columns that aren’t then displayed:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-39",
      "source": [
        "%%sql\n",
        "SELECT id FROM Visited WHERE site = 'DR-1';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-40",
      "source": "<p>![SQL Filtering in Action]`(../../images/carpentries-sql/sql-filter.svg)</p>\n<p>We can use many other Boolean operators to filter our data.\nFor example,\nwe can ask for all information from the DR-1 site collected before 1930:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-41",
      "source": [
        "%%sql\n",
        "SELECT * FROM Visited WHERE site = 'DR-1' AND dated < '1930-01-01';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-42",
      "source": "<blockquote class=\"tip\" style=\"border: 2px solid #FFE19E; margin: 1em 0.2em\">\n<div class=\"box-title tip-title\" id=\"tip-date-types\"><button class=\"gtn-boxify-button tip\" type=\"button\" aria-controls=\"tip-date-types\" aria-expanded=\"true\"><i class=\"far fa-lightbulb\" aria-hidden=\"true\" ></i> <span>Tip: Date Types</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>Most database managers have a special data type for dates.\nIn fact, many have two:\none for dates,\nsuch as “May 31, 1971”,\nand one for durations,\nsuch as “31 days”.\nSQLite doesn’t:\ninstead,\nit stores dates as either text\n(in the ISO-8601 standard format “YYYY-MM-DD HH:MM:SS.SSSS”),\nreal numbers\n(<a href=\"https://en.wikipedia.org/wiki/Julian_day\">Julian days</a>, the number of days since November 24, 4714 BCE),\nor integers\n(<a href=\"https://en.wikipedia.org/wiki/Unix_time\">Unix time</a>, the number of seconds since midnight, January 1, 1970).\nIf this sounds complicated,\nit is,\nbut not nearly as complicated as figuring out\n<a href=\"https://en.wikipedia.org/wiki/Swedish_calendar\">historical dates in Sweden</a>.</p>\n</blockquote>\n<blockquote class=\"tip\" style=\"border: 2px solid #FFE19E; margin: 1em 0.2em\">\n<div class=\"box-title tip-title\" id=\"tip-30-1930-or-2030\"><button class=\"gtn-boxify-button tip\" type=\"button\" aria-controls=\"tip-30-1930-or-2030\" aria-expanded=\"true\"><i class=\"far fa-lightbulb\" aria-hidden=\"true\" ></i> <span>Tip: '30: 1930 or 2030?</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>Storing the year as the last two digits causes problems in databases, and is part of what caused\n<a href=\"https://en.wikipedia.org/wiki/Year_2000_problem\">Y2K</a>. Be sure to use the databases’ built in\nformat for storing dates, if it is available as that will generally avoid any major issues.</p>\n<p>Similarly there is a <a href=\"https://en.wikipedia.org/wiki/Year_2000_problem#Year_2038_problem\">“Year 2038 problem”</a>,\nas the timestamps mentioned above that count seconds since Jan 1, 1970 were running out of space\non 32-bit machines. Many systems have since migrated to work around this with 64-bit timestamps.</p>\n</blockquote>\n<p>If we want to find out what measurements were taken by either Lake or Roerich,\nwe can combine the tests on their names using <code style=\"color: inherit\">OR</code>:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-43",
      "source": [
        "%%sql\n",
        "SELECT * FROM Survey WHERE person = 'lake' OR person = 'roe';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-44",
      "source": "<p>Alternatively,\nwe can use <code style=\"color: inherit\">IN</code> to see if a value is in a specific set:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-45",
      "source": [
        "%%sql\n",
        "SELECT * FROM Survey WHERE person IN ('lake', 'roe');"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-46",
      "source": "<p>We can combine <code style=\"color: inherit\">AND</code> with <code style=\"color: inherit\">OR</code>,\nbut we need to be careful about which operator is executed first.\nIf we <em>don’t</em> use parentheses,\nwe get this:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-47",
      "source": [
        "%%sql\n",
        "SELECT * FROM Survey WHERE quant = 'sal' AND person = 'lake' OR person = 'roe';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-48",
      "source": "<p>which is salinity measurements by Lake,\nand <em>any</em> measurement by Roerich.\nWe probably want this instead:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-49",
      "source": [
        "%%sql\n",
        "SELECT * FROM Survey WHERE quant = 'sal' AND (person = 'lake' OR person = 'roe');"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-50",
      "source": "<p>We can also filter by partial matches.  For example, if we want to\nknow something just about the site names beginning with “DR” we can\nuse the <code style=\"color: inherit\">LIKE</code> keyword.  The percent symbol acts as a\nwildcard, matching any characters in that\nplace.  It can be used at the beginning, middle, or end of the string\n<a href=\"https://www.w3schools.com/sql/sql_wildcards.asp\">See this page on wildcards</a> for more information:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-51",
      "source": [
        "%%sql\n",
        "SELECT * FROM Visited WHERE site LIKE 'DR%';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-52",
      "source": "<p>Finally,\nwe can use <code style=\"color: inherit\">DISTINCT</code> with <code style=\"color: inherit\">WHERE</code>\nto give a second level of filtering:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-53",
      "source": [
        "%%sql\n",
        "SELECT DISTINCT person, quant FROM Survey WHERE person = 'lake' OR person = 'roe';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-54",
      "source": "<p>But remember:\n<code style=\"color: inherit\">DISTINCT</code> is applied to the values displayed in the chosen columns,\nnot to the entire rows as they are being processed.</p>\n<blockquote class=\"tip\" style=\"border: 2px solid #FFE19E; margin: 1em 0.2em\">\n<div class=\"box-title tip-title\" id=\"tip-growing-queries\"><button class=\"gtn-boxify-button tip\" type=\"button\" aria-controls=\"tip-growing-queries\" aria-expanded=\"true\"><i class=\"far fa-lightbulb\" aria-hidden=\"true\" ></i> <span>Tip: Growing Queries</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>What we have just done is how most people “grow” their <abbr title=\"Structured Query Language\">SQL</abbr> queries.\nWe started with something simple that did part of what we wanted,\nthen added more clauses one by one,\ntesting their effects as we went.\nThis is a good strategy — in fact,\nfor complex queries it’s often the <em>only</em> strategy — but\nit depends on quick turnaround,\nand on us recognizing the right answer when we get it.</p>\n<p>The best way to achieve quick turnaround is often\nto put a subset of data in a temporary database\nand run our queries against that,\nor to fill a small database with synthesized records.\nFor example,\ninstead of trying our queries against an actual database of 20 million Australians,\nwe could run it against a sample of ten thousand,\nor write a small program to generate ten thousand random (but plausible) records\nand use that.</p>\n</blockquote>\n<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-fix-this-query\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: Fix This Query</div>\n<p>Suppose we want to select all sites that lie within 48 degrees of the equator.\nOur first query is:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT * FROM Site WHERE (lat &gt; -48) OR (lat &lt; 48);\n</code></pre></div>  </div>\n<p>Explain why this is wrong,\nand rewrite the query so that it is correct.</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution-4\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution-4\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>Because we used <code style=\"color: inherit\">OR</code>, a site on the South Pole for example will still meet\nthe second criteria and thus be included. Instead, we want to restrict this\nto sites that meet <em>both</em> criteria:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT * FROM Site WHERE (lat &gt; -48) AND (lat &lt; 48);\n</code></pre></div>    </div>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-55",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-56",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-finding-outliers\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: Finding Outliers</div>\n<p>Normalized salinity readings are supposed to be between 0.0 and 1.0.\nWrite a query that selects all records from <code style=\"color: inherit\">Survey</code>\nwith salinity values outside this range.</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution-5\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution-5\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT * FROM Survey WHERE quant = 'sal' AND ((reading &gt; 1.0) OR (reading &lt; 0.0));\n</code></pre></div>    </div>\n<table>\n<thead>\n<tr>\n<th>taken</th>\n<th>person</th>\n<th>quant</th>\n<th>reading</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>752</td>\n<td>roe</td>\n<td>sal</td>\n<td>41.6</td>\n</tr>\n<tr>\n<td>837</td>\n<td>roe</td>\n<td>sal</td>\n<td>22.5</td>\n</tr>\n</tbody>\n</table>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-57",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-58",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-matching-patterns\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question:  Matching Patterns</div>\n<p>Which of these expressions are true?</p>\n<ol>\n<li><code style=\"color: inherit\">'a' LIKE 'a'</code></li>\n<li><code style=\"color: inherit\">'a' LIKE '%a'</code></li>\n<li><code style=\"color: inherit\">'beta' LIKE '%a'</code></li>\n<li><code style=\"color: inherit\">'alpha' LIKE 'a%%'</code></li>\n<li><code style=\"color: inherit\">'alpha' LIKE 'a%p%'</code></li>\n</ol>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution-6\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution-6\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<ol>\n<li>True because these are the same character.</li>\n<li>True because the wildcard can match <em>zero</em> or more characters.</li>\n<li>True because the <code style=\"color: inherit\">%</code> matches <code style=\"color: inherit\">bet</code> and the <code style=\"color: inherit\">a</code> matches the <code style=\"color: inherit\">a</code>.</li>\n<li>True because the first wildcard matches <code style=\"color: inherit\">lpha</code> and the second wildcard matches zero characters (or vice versa).</li>\n<li>True because the first wildcard matches <code style=\"color: inherit\">l</code> and the second wildcard matches <code style=\"color: inherit\">ha</code>.</li>\n</ol>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-59",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-60",
      "source": "<blockquote class=\"tip\" style=\"border: 2px solid #FFE19E; margin: 1em 0.2em\">\n<div class=\"box-title tip-title\" id=\"tip-case-insensitive-matching\"><button class=\"gtn-boxify-button tip\" type=\"button\" aria-controls=\"tip-case-insensitive-matching\" aria-expanded=\"true\"><i class=\"far fa-lightbulb\" aria-hidden=\"true\" ></i> <span>Tip: Case-insensitive matching</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>But what about if you don’t care about if it’s <code style=\"color: inherit\">ALPHA</code> or <code style=\"color: inherit\">alpha</code> in the database, and you are\nusing a language that has a notion of case (unlike e.g. Chinese, Japenese)?</p>\n<p>Then you can use the <code style=\"color: inherit\">ILIKE</code> operator for ‘case Insensitive LIKE’.\nfor example the following are true:</p>\n<ul>\n<li><code style=\"color: inherit\">'a' ILIKE 'A'</code></li>\n<li><code style=\"color: inherit\">'AlPhA' ILIKE '%lpha'</code></li>\n</ul>\n</blockquote>\n<h1 id=\"calculating-new-values\">Calculating New Values</h1>\n<p>After carefully re-reading the expedition logs,\nwe realize that the radiation measurements they report\nmay need to be corrected upward by 5%.\nRather than modifying the stored data,\nwe can do this calculation on the fly\nas part of our query:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-61",
      "source": [
        "%%sql\n",
        "SELECT 1.05 * reading FROM Survey WHERE quant = 'rad';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-62",
      "source": "<p>When we run the query,\nthe expression <code style=\"color: inherit\">1.05 * reading</code> is evaluated for each row.\nExpressions can use any of the fields,\nall of usual arithmetic operators,\nand a variety of common functions.\n(Exactly which ones depends on which database manager is being used.)\nFor example,\nwe can convert temperature readings from Fahrenheit to Celsius\nand round to two decimal places:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-63",
      "source": [
        "%%sql\n",
        "SELECT taken, round(5 * (reading - 32) / 9, 2) FROM Survey WHERE quant = 'temp';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-64",
      "source": "<p>As you can see from this example, though, the string describing our\nnew field (generated from the equation) can become quite unwieldy. <abbr title=\"Structured Query Language\">SQL</abbr>\nallows us to rename our fields, any field for that matter, whether it\nwas calculated or one of the existing fields in our database, for\nsuccinctness and clarity. For example, we could write the previous\nquery as:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-65",
      "source": [
        "%%sql\n",
        "SELECT taken, round(5 * (reading - 32) / 9, 2) as Celsius FROM Survey WHERE quant = 'temp';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-66",
      "source": "<p>We can also combine values from different fields,\nfor example by using the string concatenation operator <code style=\"color: inherit\">||</code>:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-67",
      "source": [
        "%%sql\n",
        "SELECT personal || ' ' || family FROM Person;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-68",
      "source": "<p>But of course that can also be solved by simply having a single name field which avoids other\nissues.</p>\n<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-fixing-salinity-readings\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question:  Fixing Salinity Readings</div>\n<p>After further reading,\nwe realize that Valentina Roerich\nwas reporting salinity as percentages.\nWrite a query that returns all of her salinity measurements\nfrom the <code style=\"color: inherit\">Survey</code> table\nwith the values divided by 100.</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution-7\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution-7\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT taken, reading / 100 FROM Survey WHERE person = 'roe' AND quant = 'sal';\n</code></pre></div>    </div>\n<table>\n<thead>\n<tr>\n<th>taken</th>\n<th>reading / 100</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>752</td>\n<td>0.416</td>\n</tr>\n<tr>\n<td>837</td>\n<td>0.225</td>\n</tr>\n</tbody>\n</table>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-69",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-70",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-unions\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: Unions</div>\n<p>The <code style=\"color: inherit\">UNION</code> operator combines the results of two queries:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT * FROM Person WHERE id = 'dyer' UNION SELECT * FROM Person WHERE id = 'roe';\n</code></pre></div>  </div>\n<table>\n<thead>\n<tr>\n<th>id</th>\n<th>personal</th>\n<th>family</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>dyer</td>\n<td>William</td>\n<td>Dyer</td>\n</tr>\n<tr>\n<td>roe</td>\n<td>Valentina</td>\n<td>Roerich</td>\n</tr>\n</tbody>\n</table>\n<p>The <code style=\"color: inherit\">UNION ALL</code> command is equivalent to the <code style=\"color: inherit\">UNION</code> operator,\nexcept that <code style=\"color: inherit\">UNION ALL</code> will select all values.\nThe difference is that <code style=\"color: inherit\">UNION ALL</code> will not eliminate duplicate rows.\nInstead, <code style=\"color: inherit\">UNION ALL</code> pulls all rows from the query\nspecifics and combines them into a table.\nThe <code style=\"color: inherit\">UNION</code> command does a <code style=\"color: inherit\">SELECT DISTINCT</code> on the results set.\nIf all the records to be returned are unique from your union,\nuse <code style=\"color: inherit\">UNION ALL</code> instead, it gives faster results since it skips the <code style=\"color: inherit\">DISTINCT</code> step.\nFor this section, we shall use UNION.</p>\n<p>Use <code style=\"color: inherit\">UNION</code> to create a consolidated list of salinity measurements\nin which Valentina Roerich’s, and only Valentina’s,\nhave been corrected as described in the previous challenge.\nThe output should be something like:</p>\n<table>\n<thead>\n<tr>\n<th>taken</th>\n<th>reading</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>619</td>\n<td>0.13</td>\n</tr>\n<tr>\n<td>622</td>\n<td>0.09</td>\n</tr>\n<tr>\n<td>734</td>\n<td>0.05</td>\n</tr>\n<tr>\n<td>751</td>\n<td>0.1</td>\n</tr>\n<tr>\n<td>752</td>\n<td>0.09</td>\n</tr>\n<tr>\n<td>752</td>\n<td>0.416</td>\n</tr>\n<tr>\n<td>837</td>\n<td>0.21</td>\n</tr>\n<tr>\n<td>837</td>\n<td>0.225</td>\n</tr>\n</tbody>\n</table>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution-8\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution-8\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT taken, reading FROM Survey WHERE person != 'roe' AND quant = 'sal' UNION SELECT taken, reading / 100 FROM Survey WHERE person = 'roe' AND quant = 'sal' ORDER BY taken ASC;\n</code></pre></div>    </div>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-71",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-72",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-selecting-major-site-identifiers\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: Selecting Major Site Identifiers</div>\n<p>The site identifiers in the <code style=\"color: inherit\">Visited</code> table have two parts\nseparated by a ‘-‘:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT DISTINCT site FROM Visited;\n</code></pre></div>  </div>\n<table>\n<thead>\n<tr>\n<th>site</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>DR-1</td>\n</tr>\n<tr>\n<td>DR-3</td>\n</tr>\n<tr>\n<td>MSK-4</td>\n</tr>\n</tbody>\n</table>\n<p>Some major site identifiers (i.e. the letter codes) are two letters long and some are three.\nThe “in string” function <code style=\"color: inherit\">instr(X, Y)</code>\nreturns the 1-based index of the first occurrence of string Y in string X,\nor 0 if Y does not exist in X.\nThe substring function <code style=\"color: inherit\">substr(X, I, [L])</code>\nreturns the substring of X starting at index I, with an optional length L.\nUse these two functions to produce a list of unique major site identifiers.\n(For this data,\nthe list should contain only “DR” and “MSK”).</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution-9\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution-9\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT DISTINCT substr(site, 1, instr(site, '-') - 1) AS MajorSite FROM Visited;\n</code></pre></div>    </div>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-73",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-74",
      "source": "<h1 id=\"missing-data\">Missing Data</h1>\n<p>Real-world data is never complete — there are always holes.\nDatabases represent these holes using a special value called <code style=\"color: inherit\">null</code>.\n<code style=\"color: inherit\">null</code> is not zero, <code style=\"color: inherit\">False</code>, or the empty string;\nit is a one-of-a-kind value that means “nothing here”.\nDealing with <code style=\"color: inherit\">null</code> requires a few special tricks\nand some careful thinking.</p>\n<p>By default, the Python SQL interface does not display NULL values in its output, instead it shows <code style=\"color: inherit\">None</code>.</p>\n<p>To start,\nlet’s have a look at the <code style=\"color: inherit\">Visited</code> table.\nThere are eight records,\nbut #752 doesn’t have a date — or rather,\nits date is null:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-75",
      "source": [
        "%%sql\n",
        "SELECT * FROM Visited;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-76",
      "source": "<p>Null doesn’t behave like other values.\nIf we select the records that come before 1930:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-77",
      "source": [
        "%%sql\n",
        "SELECT * FROM Visited WHERE dated < '1930-01-01';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-78",
      "source": "<p>we get two results,\nand if we select the ones that come during or after 1930:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-79",
      "source": [
        "%%sql\n",
        "SELECT * FROM Visited WHERE dated >= '1930-01-01';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-80",
      "source": "<p>we get five,\nbut record #752 isn’t in either set of results.\nThe reason is that\n<code class=\"language-plaintext highlighter-rouge\">null&lt;'1930-01-01'</code>\nis neither true nor false:\nnull means, “We don’t know,”\nand if we don’t know the value on the left side of a comparison,\nwe don’t know whether the comparison is true or false.\nSince databases represent “don’t know” as null,\nthe value of <code style=\"color: inherit\">null&lt;'1930-01-01'</code>\nis actually <code style=\"color: inherit\">null</code>.\n<code style=\"color: inherit\">null&gt;='1930-01-01'</code> is also null\nbecause we can’t answer to that question either.\nAnd since the only records kept by a <code style=\"color: inherit\">WHERE</code>\nare those for which the test is true,\nrecord #752 isn’t included in either set of results.</p>\n<p>Comparisons aren’t the only operations that behave this way with nulls.\n<code style=\"color: inherit\">1+null</code> is <code style=\"color: inherit\">null</code>,\n<code style=\"color: inherit\">5*null</code> is <code style=\"color: inherit\">null</code>,\n<code style=\"color: inherit\">log(null)</code> is <code style=\"color: inherit\">null</code>,\nand so on.\nIn particular,\ncomparing things to null with = and != produces null:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-81",
      "source": [
        "%%sql\n",
        "SELECT * FROM Visited WHERE dated = NULL;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-82",
      "source": "<p>produces no output, and neither does:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-83",
      "source": [
        "%%sql\n",
        "SELECT * FROM Visited WHERE dated != NULL;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-84",
      "source": "<p>To check whether a value is <code style=\"color: inherit\">null</code> or not,\nwe must use a special test <code style=\"color: inherit\">IS NULL</code>:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-85",
      "source": [
        "%%sql\n",
        "SELECT * FROM Visited WHERE dated IS NULL;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-86",
      "source": "<p>or its inverse <code style=\"color: inherit\">IS NOT NULL</code>:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-87",
      "source": [
        "%%sql\n",
        "SELECT * FROM Visited WHERE dated IS NOT NULL;"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-88",
      "source": "<p>Null values can cause headaches wherever they appear.\nFor example,\nsuppose we want to find all the salinity measurements\nthat weren’t taken by Lake.\nIt’s natural to write the query like this:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-89",
      "source": [
        "%%sql\n",
        "SELECT * FROM Survey WHERE quant = 'sal' AND person != 'lake';"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-90",
      "source": "<p>but this query filters omits the records\nwhere we don’t know who took the measurement.\nOnce again,\nthe reason is that when <code style=\"color: inherit\">person</code> is <code style=\"color: inherit\">null</code>,\nthe <code style=\"color: inherit\">!=</code> comparison produces <code style=\"color: inherit\">null</code>,\nso the record isn’t kept in our results.\nIf we want to keep these records\nwe need to add an explicit check:</p>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-91",
      "source": [
        "%%sql\n",
        "SELECT * FROM Survey WHERE quant = 'sal' AND (person != 'lake' OR person IS NULL);"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-92",
      "source": "<p>We still have to decide whether this is the right thing to do or not.\nIf we want to be absolutely sure that\nwe aren’t including any measurements by Lake in our results,\nwe need to exclude all the records for which we don’t know who did the work.</p>\n<p>In contrast to arithmetic or Boolean operators, aggregation functions\nthat combine multiple values, such as <code style=\"color: inherit\">min</code>, <code style=\"color: inherit\">max</code> or <code style=\"color: inherit\">avg</code>, <em>ignore</em>\n<code style=\"color: inherit\">null</code> values. In the majority of cases, this is a desirable output:\nfor example, unknown values are thus not affecting our data when we\nare averaging it. Aggregation functions will be addressed in more\ndetail in <a href=\"#\">the next section</a>.</p>\n<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-sorting-by-known-date\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: Sorting by Known Date</div>\n<p>Write a query that sorts the records in <code style=\"color: inherit\">Visited</code> by date,\nomitting entries for which the date is not known\n(i.e., is null).</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution-10\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution-10\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT * FROM Visited WHERE dated IS NOT NULL ORDER BY dated ASC;\n</code></pre></div>    </div>\n<table>\n<thead>\n<tr>\n<th>id</th>\n<th>site</th>\n<th>dated</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>619</td>\n<td>DR-1</td>\n<td>1927-02-08</td>\n</tr>\n<tr>\n<td>622</td>\n<td>DR-1</td>\n<td>1927-02-10</td>\n</tr>\n<tr>\n<td>734</td>\n<td>DR-3</td>\n<td>1930-01-07</td>\n</tr>\n<tr>\n<td>735</td>\n<td>DR-3</td>\n<td>1930-01-12</td>\n</tr>\n<tr>\n<td>751</td>\n<td>DR-3</td>\n<td>1930-02-26</td>\n</tr>\n<tr>\n<td>837</td>\n<td>MSK-4</td>\n<td>1932-01-14</td>\n</tr>\n<tr>\n<td>844</td>\n<td>DR-1</td>\n<td>1932-03-22</td>\n</tr>\n</tbody>\n</table>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-93",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-94",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-null-in-a-set\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: NULL in a Set</div>\n<p>What do you expect the following query to produce?</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT * FROM Visited WHERE dated IN ('1927-02-08', NULL);\n</code></pre></div>  </div>\n<p>What does it actually produce?</p>\n<br/><details style=\"border: 2px solid #B8C3EA; margin: 1em 0.2em;padding: 0.5em; cursor: pointer;\"><summary>👁 View solution</summary>\n<div class=\"box-title solution-title\" id=\"solution-11\"><button class=\"gtn-boxify-button solution\" type=\"button\" aria-controls=\"solution-11\" aria-expanded=\"true\"><i class=\"far fa-eye\" aria-hidden=\"true\" ></i> <span>Solution</span><span class=\"fold-unfold fa fa-minus-square\"></span></button></div>\n<p>You might expect the above query to return rows where dated is either ‘1927-02-08’ or NULL.\nInstead it only returns rows where dated is ‘1927-02-08’, the same as you would get from this\nsimpler query:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT * FROM Visited WHERE dated IN ('1927-02-08');\n</code></pre></div>    </div>\n<p>The reason is that the <code style=\"color: inherit\">IN</code> operator works with a set of <em>values</em>, but NULL is by definition\nnot a value and is therefore simply ignored.</p>\n<p>If we wanted to actually include NULL, we would have to rewrite the query to use the IS NULL condition:</p>\n<div class=\"language-plaintext highlighter-rouge\"><div><pre style=\"color: inherit; background: transparent\"><code style=\"color: inherit\">SELECT * FROM Visited WHERE dated = '1927-02-08' OR dated IS NULL;\n</code></pre></div>    </div>\n</details>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-95",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-96",
      "source": "<blockquote class=\"question\" style=\"border: 2px solid #8A9AD0; margin: 1em 0.2em\">\n<div class=\"box-title question-title\" id=\"question-pros-and-cons-of-sentinels\"><i class=\"far fa-question-circle\" aria-hidden=\"true\" ></i> Question: Pros and Cons of Sentinels</div>\n<p>Some database designers prefer to use\na sentinel value\nto mark missing data rather than <code style=\"color: inherit\">null</code>.\nFor example,\nthey will use the date “0000-00-00” to mark a missing date,\nor -1.0 to mark a missing salinity or radiation reading\n(since actual readings cannot be negative).\nWhat does this simplify?\nWhat burdens or risks does it introduce?</p>\n</blockquote>\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "id": "cell-97",
      "source": [
        "%%sql\n",
        "-- Try solutions here!"
      ],
      "cell_type": "code",
      "execution_count": null,
      "outputs": [],
      "metadata": {
        "attributes": {
          "classes": [
            "> <comment-title></comment-title>"
          ],
          "id": ""
        }
      }
    },
    {
      "id": "cell-98",
      "source": "\n",
      "cell_type": "markdown",
      "metadata": {
        "editable": false,
        "collapsed": false
      }
    },
    {
      "cell_type": "markdown",
      "id": "final-ending-cell",
      "metadata": {
        "editable": false,
        "collapsed": false
      },
      "source": [
        "# Key Points\n\n",
        "- A relational database stores information in tables, each of which has a fixed set of columns and a variable number of records.\n",
        "- A database manager is a program that manipulates information stored in a database.\n",
        "- We write queries in a specialized language called SQL to extract information from databases.\n",
        "- Use SELECT... FROM... to get values from a database table.\n",
        "- SQL is case-insensitive (but data is case-sensitive).\n",
        "- The records in a database table are not intrinsically ordered: if we want to display them in some order, we must specify that explicitly with ORDER BY.\n",
        "- The values in a database are not guaranteed to be unique: if we want to eliminate duplicates, we must specify that explicitly as well using DISTINCT.\n",
        "- Use WHERE to specify conditions that records must meet in order to be included in a query's results.\n",
        "- Use AND, OR, and NOT to combine tests.\n",
        "- Filtering is done on whole records, so conditions can use fields that are not actually displayed.\n",
        "- Write queries incrementally.\n",
        "- Queries can do the usual arithmetic operations on values.\n",
        "- Use UNION to combine the results of two or more queries.\n",
        "- Databases use a special value called NULL to represent missing information.\n",
        "- Almost all operations on NULL produce NULL.\n",
        "- Queries can test for NULLs using IS NULL and IS NOT NULL.\n",
        "\n# Congratulations on successfully completing this tutorial!\n\n",
        "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/sql-basic/tutorial.html#feedback) and check there for further resources!\n"
      ]
    }
  ]
}