{ "metadata": {}, "nbformat": 4, "nbformat_minor": 5, "cells": [ { "id": "metadata", "cell_type": "markdown", "source": "
This tutorial will introduce you to accessing a SQL database from within Python. Experience with both SQL and Python is a pre-requisite.
\n\n\nComment\nThis tutorial is significantly based on the Carpentries Databases and SQL lesson, which is licensed CC-BY 4.0.
\nAbigail Cabunoc and Sheldon McKay (eds): “Software Carpentry: Using Databases and SQL.” Version 2017.08, August 2017,\ngithub.com/swcarpentry/sql-novice-survey, https://doi.org/10.5281/zenodo.838776
\nAdaptations have been made to make this work better in a GTN/Galaxy environment.
\n
\n\nAgenda\nIn this tutorial, we will cover:
\n
For this tutorial we need to download a database that we will use for the queries.
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-1", "source": [ "!wget -c http://swcarpentry.github.io/sql-novice-survey/files/survey.db" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ ">Let’s have a look at how to access a database from\na general-purpose programming language like Python.\nOther languages use almost exactly the same model:\nlibrary and function names may differ,\nbut the concepts are the same.
\nHere’s a short Python program that selects latitudes and longitudes\nfrom an SQLite database stored in a file called survey.db
:
The program starts by importing the sqlite3
library.\nIf we were connecting to MySQL, DB2, or some other database,\nwe would import a different library,\nbut all of them provide the same functions,\nso that the rest of our program does not have to change\n(at least, not much)\nif we switch from one database to another.
Line 2 establishes a connection to the database.\nSince we’re using SQLite,\nall we need to specify is the name of the database file.\nOther systems may require us to provide a username and password as well.\nLine 3 then uses this connection to create a cursor.\nJust like the cursor in an editor,\nits role is to keep track of where we are in the database.
\nOn line 4, we use that cursor to ask the database to execute a query for us.\nThe query is written in SQL,\nand passed to cursor.execute
as a string.\nIt’s our job to make sure that SQL is properly formatted;\nif it isn’t,\nor if something goes wrong when it is being executed,\nthe database will report an error.
The database returns the results of the query to us\nin response to the cursor.fetchall
call on line 5.\nThis result is a list with one entry for each record in the result set;\nif we loop over that list (line 6) and print those list entries (line 7),\nwe can see that each one is a tuple\nwith one element for each field we asked for.
Finally, lines 8 and 9 close our cursor and our connection,\nsince the database can only keep a limited number of these open at one time.\nSince establishing a connection takes time,\nthough,\nwe shouldn’t open a connection,\ndo one operation,\nthen close the connection,\nonly to reopen it a few microseconds later to do another operation.\nInstead,\nit’s normal to create one connection that stays open for the lifetime of the program.
\nQueries in real applications will often depend on values provided by users.\nFor example,\nthis function takes a user’s ID as a parameter and returns their name:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-5", "source": [ "import sqlite3\n", "\n", "def get_name(database_file, person_id):\n", " query = \"SELECT personal || ' ' || family FROM Person WHERE id='\" + person_id + \"';\"\n", "\n", " connection = sqlite3.connect(database_file)\n", " cursor = connection.cursor()\n", " cursor.execute(query)\n", " results = cursor.fetchall()\n", " cursor.close()\n", " connection.close()\n", "\n", " return results[0][0]\n", "\n", "print(\"Full name for dyer:\", get_name('survey.db', 'dyer'))" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ ">We use string concatenation on the first line of this function\nto construct a query containing the user ID we have been given.\nThis seems simple enough,\nbut what happens if someone gives us this string as input?
\ndyer'; DROP TABLE Survey; SELECT '\n
It looks like there’s garbage after the user’s ID,\nbut it is very carefully chosen garbage.\nIf we insert this string into our query,\nthe result is:
\nSELECT personal || ' ' || family FROM Person WHERE id='dyer'; DROP TABLE Survey; SELECT '';\n
If we execute this,\nit will erase one of the tables in our database.
\nThis is called an SQL injection attack,\nand it has been used to attack thousands of programs over the years.\nIn particular,\nmany web sites that take data from users insert values directly into queries\nwithout checking them carefully first.\nA very relevant XKCD that explains the\ndangers of using raw input in queries a little more succinctly:
\n\nSince a villain might try to smuggle commands into our queries in many different ways,\nthe safest way to deal with this threat is\nto replace characters like quotes with their escaped equivalents,\nso that we can safely put whatever the user gives us inside a string.\nWe can do this by using a prepared statement\ninstead of formatting our statements as strings.\nHere’s what our example program looks like if we do this:
\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "id": "cell-7", "source": [ "import sqlite3\n", "\n", "def get_name(database_file, person_id):\n", " query = \"SELECT personal || ' ' || family FROM Person WHERE id=?;\"\n", "\n", " connection = sqlite3.connect(database_file)\n", " cursor = connection.cursor()\n", " cursor.execute(query, [person_id])\n", " results = cursor.fetchall()\n", " cursor.close()\n", " connection.close()\n", "\n", " return results[0][0]\n", "\n", "print(\"Full name for dyer:\", get_name('survey.db', 'dyer'))" ], "cell_type": "code", "execution_count": null, "outputs": [], "metadata": { "attributes": { "classes": [ ">The key changes are in the query string and the execute
call.\nInstead of formatting the query ourselves,\nwe put question marks in the query template where we want to insert values.\nWhen we call execute
,\nwe provide a list\nthat contains as many values as there are question marks in the query.\nThe library matches values to question marks in order,\nand translates any special characters in the values\ninto their escaped equivalents\nso that they are safe to use.
We can also use sqlite3
’s cursor to make changes to our database,\nsuch as inserting a new name.\nFor instance, we can define a new function called add_name
like so:
Note that in versions of sqlite3 >= 2.5, the get_name
function described\nabove will fail with an IndexError: list index out of range
,\neven though we added Mary’s\nentry into the table using add_name
.\nThis is because we must perform a connection.commit()
before closing\nthe connection, in order to save our changes to the database.
\n\nQuestion: Filling a Table vs. Printing Values\nWrite a Python program that creates a new database in a file called\n
\noriginal.db
containing a single table calledPressure
, with a\nsingle field calledreading
, and inserts 100,000 random numbers\nbetween 10.0 and 25.0. How long does it take this program to run?\nHow long does it take to run a program that simply writes those\nrandom numbers to a file?\n👁 View solution
\n\n\nimport sqlite3\n# import random number generator\nfrom numpy.random import uniform\n\nrandom_numbers = uniform(low=10.0, high=25.0, size=100000)\n\nconnection = sqlite3.connect(\"original.db\")\ncursor = connection.cursor()\ncursor.execute(\"CREATE TABLE Pressure (reading float not null)\")\nquery = \"INSERT INTO Pressure (reading) VALUES (?);\"\n\nfor number in random_numbers:\n cursor.execute(query, [number])\n\ncursor.close()\nconnection.commit() # save changes to file for next exercise\nconnection.close()\n
For comparison, the following program writes the random numbers\ninto the file
\nrandom_numbers.txt
:\nfrom numpy.random import uniform\n\nrandom_numbers = uniform(low=10.0, high=25.0, size=100000)\nwith open('random_numbers.txt', 'w') as outfile:\n for number in random_numbers:\n # need to add linebreak \\n\n outfile.write(\"{}\\n\".format(number))\n
\n\n\nQuestion: Filtering in SQL vs. Filtering in Python\nWrite a Python program that creates a new database called\n
\nbackup.db
with the same structure asoriginal.db
and copies all\nthe values greater than 20.0 fromoriginal.db
tobackup.db
.\nWhich is faster: filtering values in the query, or reading\neverything into memory and filtering in Python?👁 View solution
\n\nThe first example reads all the data into memory and filters the\nnumbers using the if statement in Python.
\n\nimport sqlite3\n\nconnection_original = sqlite3.connect(\"original.db\")\ncursor_original = connection_original.cursor()\ncursor_original.execute(\"SELECT * FROM Pressure;\")\nresults = cursor_original.fetchall()\ncursor_original.close()\nconnection_original.close()\n\nconnection_backup = sqlite3.connect(\"backup.db\")\ncursor_backup = connection_backup.cursor()\ncursor_backup.execute(\"CREATE TABLE Pressure (reading float not null)\")\nquery = \"INSERT INTO Pressure (reading) VALUES (?);\"\n\nfor entry in results:\n # number is saved in first column of the table\n if entry[0] > 20.0:\n cursor_backup.execute(query, entry)\n\ncursor_backup.close()\nconnection_backup.commit()\nconnection_backup.close()\n
In contrast the following example uses the conditional
\nSELECT
statement\nto filter the numbers in SQL.\nThe only lines that changed are in line 5, where the values are fetched\nfrom
original.db
and the for loop starting in line 15 used to insert\nthe numbers intobackup.db
.\nNote how this version does not require the use of Python’s if statement.\nimport sqlite3\n\nconnection_original = sqlite3.connect(\"original.db\")\ncursor_original = connection_original.cursor()\ncursor_original.execute(\"SELECT * FROM Pressure WHERE reading > 20.0;\")\nresults = cursor_original.fetchall()\ncursor_original.close()\nconnection_original.close()\n\nconnection_backup = sqlite3.connect(\"backup.db\")\ncursor_backup = connection_backup.cursor()\ncursor_backup.execute(\"CREATE TABLE Pressure (reading float not null)\")\nquery = \"INSERT INTO Pressure (reading) VALUES (?);\"\n\nfor entry in results:\n cursor_backup.execute(query, entry)\n\ncursor_backup.close()\nconnection_backup.commit()\nconnection_backup.close()\n
\n\n", "cell_type": "markdown", "metadata": { "editable": false, "collapsed": false } }, { "cell_type": "markdown", "id": "final-ending-cell", "metadata": { "editable": false, "collapsed": false }, "source": [ "# Key Points\n\n", "- General-purpose languages have libraries for accessing databases.\n", "- To connect to a database, a program must use a library specific to that database manager.\n", "- These libraries use a connection-and-cursor model.\n", "- Programs can read query results in batches or all at once.\n", "- Queries should be written using parameter substitution, not string formatting.\n", "\n# Congratulations on successfully completing this tutorial!\n\n", "Please [fill out the feedback on the GTN website](https://training.galaxyproject.org/training-material/topics/data-science/tutorials/sql-python/tutorial.html#feedback) and check there for further resources!\n" ] } ] }Question: Generating Insert Statements\nOne of our colleagues has sent us a\nCSV\nfile containing\ntemperature readings by Robert Olmstead, which is formatted like\nthis:
\n\nTaken,Temp\n619,-21.5\n622,-15.5\n
Write a small Python program that reads this file in and prints out\nthe SQL
\nINSERT
statements needed to add these records to the\nsurvey database. Note: you will need to add an entry for Olmstead\nto thePerson
table. If you are testing your program repeatedly,\nyou may want to investigate SQL’sINSERT or REPLACE
command.