What is a database?#

A database is an organised collection of (usually structured) data.

We have created a database with two tables, Star and Planet, which store information about stars and their exoplanets, sampled from the NASA exoplanet archive.

The following schematic visualises the organisation in these tables. They each have a set of attributes that identify and characterise the astronomical object.

For reference, we’ve listed the physical descriptions of all the attributes above so that later on you can come back to this slide if you’re unsure what they mean.

Attribute

Description

Kepler ID

Unique target identification number for stars

KOI name

String identifier for the Kepler Object of Interest (KOI)

Teff (K)

Effective temperature of a star in Kelvin

Radius

Radius of stars and planets in units of solar radius/earth radius respectively

Kepler name

Unique string identifier for a confirmed exoplanet in the planet table

Period

Orbital period of a planet in units of days

Status

Status of a discovered KOI in the planet table, e.g. “confirmed” or “false positive”

Teq

Equilibrium temperature of a planet in Kelvin

Kepler is a mission that measures the brightness variation of about 200,000 stars. You can read more about this groundbreaking survey and access its data products at https://archive.stsci.edu/missions-and-data/kepler

import sqlalchemy
engine_name = "sqlite:///foo.db"
sqlalchemy.create_engine(engine_name)
Engine(sqlite:///foo.db)
%load_ext sql
%sql sqlite:///foo.db

Databases and SQL#

SQL (Structured Query Language) is the most widely used database query language. SQL is designed especially for data management in relational databases. In this course, we’ll use the SQLite implementation of SQL (Grok exercises use Postgresql, but it shouldn’t matter for thess exercises).

Let’s use SQL to query the database from the previous slide. Every query takes one or more tables as arguments and returns a new table with the requested data. A simple example is shown below:

%%sql

SELECT * FROM Planet;
UsageError: Cell magic `%%sql` not found.

This query returns all the entries in the Planet table. The SELECT keyword starts the query. The next part of the statement defines the attributes to be returned, here a * means all attributes are to be included. Lastly, the FROM clause specifies which table(s) to query. Every SQL statement ends with a semicolon.

Replace Planet above with Star to look at the other table.

Cases in SQL

SQL keywords are case insensitive, i.e. Select or select have the same effect than the capitalised version.

However, comparisons are not necessarily case insensitive.

SQL queries#

In the previous example, we requested everything from the Planet table. Now let’s try extracting specific information.

  1. Specifying the attributes (or columns) Rather than retrieving all (*) attributes from a table, we can select specific columns right after the SELECT keyword:

SELECT koi_name, radius FROM Planet;

This query returns only the KOI names and the radii of the planets in our Planet table.

  1. Formulating conditions SQL queries are typically conditional – you only request data which meets certain conditions. In simple cases, we can make a conditional query using the WHERE clause with comparisons like:

SELECT koi_name, radius FROM Planet
WHERE radius < 2;

The < operator returns a true or false (Boolean) for each planet radius. This query returns all the planets which have a radius smaller than two (measured in earth radii).

Exercise: Large stars#

Write an SQL query to find the radius and temperature of the stars in the Star table that are larger than our sun.

The output of your query should look like this:

+--------+-------+
| radius | t_eff |
+--------+-------+
|  3.523 |  6335 |
|  1.965 |  8782 |
|  1.032 |  6319 |
| 27.384 |  3789 |
+--------+-------+
(4 rows)

Hint

The radii in these tables are specified in units of the radius of our sun, e.g. a radius of 2 means a star is twice as big as the sun.

%%sql

# Write your SQL query here.
UsageError: Cell magic `%%sql` not found.

Conditional queries#

As you’ve seen previously, we can use comparison operators like < and > for conditional queries. In general, SQL includes a standard set of Boolean operators, e.g. AND or NOT.

To experiment with the different operators, we can use them directly in the SELECT statement like this:

SELECT 2 > 3;
SELECT NOT 2 > 3;
SELECT 2 = 3;

The output of every query, regardless of whether a table is queried or not, is a table. The t and f here stand for true and false.

The numerical comparison operators also work on strings, which are declared using single quotation marks:

SELECT 'abc' < 'abcd';
SELECT 'abc' = 'ABC';

Note that while SQL syntax is case insensitive, strings are not. If you’re unsure what capitalisation is used in an attribute, you can use the UPPER or LOWER function to convert the attribute and then perform the comparison:

SELECT UPPER('aBc') = 'ABC';
SELECT LOWER('aBc') = 'abc';

Combining conditions#

We can also combine conditions in a WHERE clause using for example the logical and:

SELECT 0 < 1 AND 1 < 2;
+----------+
| ?column? |
+----------+
| t        |
+----------+
(1 row)

This allows us to, for example, perform range queries over a set of data. Suppose we want to find all the stars in our Star table whose radius lies between one and two earth radii, we could write:

SELECT radius FROM Star
WHERE radius >= 1 AND radius <= 2;

Since range queries are quite common, SQL provides its own syntax for these queries using the BETWEEN statement. With this, we can rewrite the query from above in a more readable form as:

SELECT radius FROM  Star
WHERE radius BETWEEN 1 AND 2;

BETWEEN comparisons are inclusive on both end points, so the two queries above are equivalent.

Exercise: A range of hot stars#

Your task is to write a range query which returns the kepler_id and the t_eff attributes of all those stars in the Star table whose temperature lies between 5000 and 6000 Kelvin (inclusive).

Your results on the example table should look like this:

+-----------+-------+
| kepler_id | t_eff |
+-----------+-------+
|   3836450 |  5160 |
|   6590362 |  5926 |
|   8106973 |  5810 |
+-----------+-------+
(3 rows)
+-----------+-------+
| kepler_id | t_eff |
+-----------+-------+
|   3836450 |  5160 |
|   6590362 |  5926 |
|   8106973 |  5810 |
+-----------+-------+
(3 rows)
%%sql

# Write your SQL query here

Looking at a database structure#

To query a table, we need to know what columns it contains. If we haven’t got any information about the table, we can use the \d statement to get a description of the table’s columns:

\d Planet;
              Table "public.planet"
+-------------+-----------------------+-----------+
|   Column    |         Type          | Modifiers |
+-------------+-----------------------+-----------+
| kepler_id   | integer               | not null  |
| koi_name    | character varying(20) | not null  |
| kepler_name | character varying(20) |           |
| status      | character varying(20) |           |
| period      | double precision      |           |
| radius      | real                  |           |
| t_eq        | integer               |           |
+-------------+-----------------------+-----------+
Indexes:
    "planet_pkey" PRIMARY KEY, btree (koi_name)

\d returns a description each column: its name, data type and optional modifiers. The data types above are mostly types that we’ve seen before. The type character varying(20) indicates that this column can hold up to 20 characters. We will have a closer look at these types in later activities.

Some columns have a not null modifier. This means that when adding data to this table, these attributes must be specified. Other attributes can be left blank. The concept of a NULL value in SQL is quite important, so we will have a closer look at that.

NULL values in SQL#

In our Planet table, not all exoplanets have a Kepler name because they aren’t all confirmed planets. In SQL, missing values are represented by NULL values, which are equivalent to an empty field.

A NULL is not equivalent to an empty string:

SELECT  '' = NULL; 
+----------+
| ?column? |
+----------+
|          |
+----------+
(1 row)

What’s this? The query returns an empty field, or, to be precise, it returns a NULL value.

This is a peculiarity of SQL databases – comparison operators cannot return true or false here since the NULL value is unknown. If we want to find out whether an attribute is actually a NULL value, we have to use the IS comparison:

SELECT NULL IS NULL;
+----------+
| ?column? |
+----------+
| t        |
+----------+
(1 row)

With this, you can also check that the first query with the = operator actually returns a NULL value:

SELECT ('' = NULL) IS NULL;

Exercise: Confirmed Exoplanets#

In this question you should write a query to find the kepler_name and radius of each planet in the Planet table which is a confirmed exoplanet, meaning that their kepler_name is not NULL, or, equivalently, whose status is 'CONFIRMED'.

Restrict your results to those planets whose radius lies between one and three earth radii, and remember that the radius of the planets is relative to the earth radius.

Your query should produce a table like this:

+--------------+--------+
| kepler_name  | radius |
+--------------+--------+
| Kepler-10 b  |   1.45 |
| Kepler-106 c |   2.35 |
| Kepler-52 d  |    1.8 |
| Kepler-239 b |   2.36 |
| Kepler-239 c |   2.19 |
+--------------+--------+
(5 rows)
%%sql

# Write your SQL query here
UsageError: Cell magic `%%sql` not found.

Counting table entries#

As databases increase in size, printing out a full table of all the requested data might not be instructive anymore or even feasible. In these cases, we can in principle make the query more specific to limit the results, but if we don’t know a priori what results to expect, we need a better way of characterising the size and statistics of the data and to limit the overall results.

As opposed to returning the requested data in table format, we can use the COUNT function to only return the number of rows in the table we requested. To count, for example, all planets in the Planet table, we can use it like this:

SELECT COUNT(*) FROM Planet;

From here on, we’re going to use larger tables; have a look at how many stars and planets there are now.

We can combine counting with all of the other SQL statements that we have seen before. It effectively replaces the attributes that you request right after the SELECT statement. Counting only the confirmed planets would for example work like this:

SELECT COUNT(*) FROM Planet
WHERE kepler_name IS NOT NULL;

Data statistics and aggregate functions#

The COUNT function that we’ve seen on the last slide belongs to SQL’s aggregate functions. An aggregate function returns a single value which was calculated based on one or more columns.

SQL offers a range of aggregate functions which are very valuable for extracting some statistical information from databases. Knowing what data, or what range of data to expect can help to make a query more specific. A few useful functions are for example:

Function

Returns

COUNT

Number of rows returned

MIN

Minimum value of a column

MAX

Maximum value of a column

SUM

Sum over the column

AVG

Mean (average) value of a column

STDDEV

Standard deviation of a column

Let’s have a look at a few examples:

SELECT MIN(radius), MAX(radius), AVG(radius)
FROM Planet;

We could also calculate an average ourselves using the division operator / in the SELECT clause and compare it to the built-in AVG:

SELECT SUM(t_eff)/COUNT(*), AVG(t_eff)
FROM Star;

Look at the two results this query returns. Why are they different? The reason here is that the attribute t_eff is an integer and SQL then automatically performs integer division upon the / operator.

Integer division

Keep in mind that SQL performs integer division on integers, i.e. only an integer value is returned. For example:

  SELECT 1/2, 1/2.0;

Sometimes this might not be what you want so keep an eye out.

Ordering results and limiting output#

The rows in a table may not be sorted in a useful order. In our tables, they appear in the order in which the stars and planets were added to the database.

We can use an ORDER BY clause to change the sort order. To sort the planets by their radius in descending order, we can write:

SELECT koi_name, radius FROM Planet
ORDER BY radius DESC;

The DESC stands for descending order; for ascending order we use ASC or put nothing.

In addition to sorting, we can restrict the number of rows returned using the LIMIT clause. If we are only interested in the five largest planets we could write:

SELECT koi_name, radius FROM Planet
ORDER BY radius DESC
LIMIT 5;

This limits the results to the first five rows (after sorting). The ability to limit results becomes especially valuable in large databases, where the data volume can severely impact performance.

Exercise: Planet statistics#

Let’s analyse the size of the unconfirmed exoplanets.

Your task is to write a query that calculates the:

  • minimum radius;

  • maximum radius;

  • average radius; and

  • standard deviation of the radii

of unconfirmed planets (with a NULL value in kepler_name) in the Planet table. It should look like this on the example data:

+------+---------+------------------+------------------+
| min  |   max   |       avg        |      stddev      |
+------+---------+------------------+------------------+
| 0.65 | 3462.25 | 275.517333333333 | 888.709924452829 |
+------+---------+------------------+------------------+
(1 row)

On a side note …

Remember what you’ve learned about the mean as a statistic. Is this a good choice for the planet data we have here?

%%sql

# Write your SQL query here.

Managing data subsets#

What if we want to work with subsets of the table simultaneously?

But first of all, what do we mean by working with different subsets simultaneously? Imagine the following problem: we want to find out if there are multiple planets within the Planet table that have the same size, as measured by their radius. How can we do that? We can write a simple query in which we order the results such that we can see multiple occurences of the same radius:

SELECT radius FROM Planet
ORDER BY radius asc;

But this is not really useful to look at. Much better would be, if we could count the radii which occur multiple times in the table.

For situations like this, we can use the GROUP BY clause, which arranges identical data in groups on which we can perform aggregate functions, like COUNT. To solve our problem from above, we can query the table like this:

SELECT radius, COUNT(koi_name)
FROM Planet
GROUP BY radius;

This query returns a table which is grouped by the planet radii, and contains the radii and the number of koi_names belonging to each radius. We decided to count the koi_name attribute since this attribute is unique for each planet.

This solution still produces a somewhat lenghty table, but it’s a lot better than our first approach. Let’s have a look at how we can refine this.

Limited counting#

If we want to restrict the results from the last slide to only those where the count is larger than one, we can add a condition for this value. Previously, we’ve added conditions using the WHERE clause, but if we try the following:

SELECT radius, COUNT(koi_name)
FROM Planet
WHERE COUNT(koi_name) > 1
GROUP BY radius;

The query fails because WHERE gets processed before the ``GROUP BY clause and does therefore not have access to aggregate functions.

In order to use aggregate functions in our conditions, we require the HAVING statement instead of WHERE, which gets called after the GROUP BY clause:

SELECT radius, COUNT(koi_name)
FROM Planet
GROUP BY radius
HAVING COUNT(koi_name) > 1;
+--------+-------+
| radius | count |
+--------+-------+
|    3.1 |     2 |
|   2.71 |     2 |
|   3.07 |     2 |
|   3.62 |     3 |
|    2.7 |     3 |
|   2.33 |     2 |
|   2.72 |     2 |
+--------+-------+
(7 rows)

Other attributes can still be used in WHERE conditions, just as we saw before:

SELECT radius, COUNT(koi_name) 
FROM Planet 
WHERE t_eq BETWEEN 500 AND 1000
GROUP BY radius
HAVING COUNT(koi_name) > 1;

Exercise: Planets in multi-planet systems#

You need to find out how many planets in the Planet database are in a multi-planet system. Planets sharing the same star will have the same kepler_id, but different koi_name values.

Your query should return a table in which each row contains the kepler_id of the star and the number of planets orbiting that star (i.e. that share this kepler_id).

Limit your results to counts above one and order the rows in descending order based on the number of planets.

The results should look like this:

+-----------+-------+
| kepler_id | count |
+-----------+-------+
|   4139816 |     4 |
|   8395660 |     4 |
|  10910878 |     3 |
|  10872983 |     3 |
|   5358241 |     3 |
|  10601284 |     3 |
|   3832474 |     3 |
|  11754553 |     3 |
|   9579641 |     3 |
|  12366084 |     2 |
|  11018648 |     2 |
|  11918099 |     2 |
|  12066335 |     2 |
|  10187017 |     2 |
|  10797460 |     2 |
|   3342970 |     2 |
|  12470844 |     2 |
|   5456651 |     2 |
+-----------+-------+
(18 rows)

Hint

You will need to group the results based on the kepler_id attribute.

%%sql

# Write your SQL query here

Congratulations, you’ve finished this set of activities.