Data Science Applications to Astronomy
Week 5: Data Wrangling:
Databases & Queries
Databases
Simplest form
Store data
Retrieve data
Commonly implemented as a set of tables
Columns contain different fields (e.g., ID, magnitude, RA, Dec)
Rows contain entries (e.g., 51 Pegasi, Kepler-10, HAT-P-13,... )
Value-added features
Return subset of data efficiently
Many strategies for how to filter data (e.g., order of operations)
Database server can use heuristics to pick good strategy
Allow for transactions to update database
Fundamental properties of databases
Atomicity: All part of a transaction succeed, or the database is rolled back to its previous state
Consistency: Data in database always satisfies its validation rules
Isolation: Even if multiple transactions are made concurrently, there is no interference between transactions
Durability: Once a transaction is committed, it will remain committed
SQL-based Database Servers
Open-source:
MySQL
PostgreSQL & Greenplum
Commercial:
Microsoft SQL Server:
IBM DB2
Oracle Database
...
Continuing innovation in database systems
SciDB (array & disk based database)
MonetDB (column store)
JuliaDB (pure Julia, for persistent data)
When selecting a database for a project, consider:
How much data is to be stored?
How frequent/large will transactions be?
Are there specific hardware or OS requirements?
Does the team have someone dedicated to supporting database?
Database Clients
One database server many clients simultaneously
Different clients can use different interfaces
Command line
Webpage
URL-based
Custom Graphical user interface (GUI)
TopCat
Queries
Query: An expression that requests database to return a specific subset of data.
Query languages:
Structured Query Language (SQL): Dated, but by far the most common
Astronomical Data Query Language (ADQL): Astronomy-specific
Language Integrated Query (LINQ): Microsoft-supported
Many more
SQL essentials
Selecting (columns)
Filtering (for rows)
Joining (multiple tables)
Aggregating (rows within a table)
SQL programming
Variables
Functions
Procedures
Data management
Transactions
Virtual Observatory (VO)
Defines standards that help astronomers to collaborate effectively, emphasizing working with multiple data sources.
Astronomical Data Query Language (ADQL)
Table Access Protocol (TAP)
Astronomy-specific functions in ADQL
AREA
BOX
CENTROID
CIRCLE
CONTAINS
COORD1
COORD2
COORDSYS
DISTANCE
INTERSECTS
POINT
POLYGON
REGION
Example Bad SQL Query
Do not send
select * from SomeTableWhy?
Example SQL Queries
Take a quick peak two columns data for first few entries
select top 10 X, Y from SomeTable Find extreme values of X
select top 10 X, Y
from SomeTable
order by XFilter which rows are returned using expression
select top 10 x,y
from SomeTable
where x*x+y*y between 0 and 1
order by x Check how many rows are in a table
select COUNT(designation) as N from gaiadr2.gaia_sourceCheck how many rows satisfy a filter
select COUNT(designation) as n, AVG(astrometric_n_good_obs_al) as astrometric_n_good_obs_al_ave
from gaiadr2.gaia_source
where phot_g_mean_mag < 14Grouping data to make a histogram
select COUNT(designation) as N,
AVG(astrometric_n_good_obs_al) as astrometric_n_good_obs_al_ave,
AVG(phot_g_mean_mag) as phot_g_mean_mag_ave,
ROUND(phot_g_mean_mag,1) as bin
from gaiadr2.gaia_source
where phot_g_mean_mag < 14
group by bin
order by binTable Access Protocol (TAP)
Start with ADQL
SELECT <column list> FROM <table> WHERE <constraints>but transform it into a url, by
Prepend a base service url
convert spaces to
+'sDeal with other special characters (e.g., +, quotes)
Optionally, specify format for results
e.g.,
https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name,pl_masse,ra,dec+from+psurl_ex1 = make_tap_query_url(nexsci_query_base_url, "ps", select_cols="pl_name,gaia_id,sy_kepmag,ra,dec", where="default_flag=1")
"https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name,gaia_id,sy_kepmag,ra,dec+from+ps+where+default_flag=1&format=tsv"
df_ex1 = query_to_df(url_ex1)
| pl_name | gaia_id | sy_kepmag | ra | dec | |
|---|---|---|---|---|---|
| 1 | "OGLE-TR-10 b" | "Gaia DR2 4056443366649948160" | missing | 267.868 | -29.8765 |
| 2 | "BD-08 2823 c" | "Gaia DR2 3770419611540574080" | missing | 150.197 | -9.51657 |
| 3 | "HD 110014 b" | "Gaia DR2 3676091134604409728" | missing | 189.811 | -7.99567 |
| 4 | "GJ 667 C f" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 |
| 5 | "GJ 667 C g" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 |
| 6 | "GJ 667 C c" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 |
| 7 | "Kepler-227 b" | "Gaia DR2 2129273849959814656" | 15.347 | 291.934 | 48.1417 |
| 8 | "HATS-61 b" | "Gaia DR2 4890849134501995392" | missing | 61.6578 | -25.3497 |
| 9 | "HATS-63 b" | "Gaia DR2 4891362198412001408" | missing | 67.4186 | -28.1972 |
| 10 | "HD 108236 c" | "Gaia DR2 6125644402384918784" | missing | 186.574 | -51.3631 |
| ... | |||||
| 5876 | "Kepler-592 b" | "Gaia DR2 2131920370147385472" | 14.665 | 285.791 | 48.6832 |
row_to_study = 7
7
desig = replace_spaces_for_tap(df_ex1.gaia_id[row_to_study])
"Gaia+DR2+2129273849959814656"
url_ex2 = make_tap_query_url(gaia_query_base_url, "gaiadr2.gaia_source", where="designation='$(desig)'",select_cols="*",max_rows=5,format="csv")
"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY=select+top+5+*+from+gaiadr2.gaia_source+where+designation='Gaia+DR2+2129273849959814656'&format=csv"
df_ex2 = query_to_df(url_ex2)
| solution_id | designation | source_id | random_index | ref_epoch | ra | ra_error | dec | ... | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1635721458409799680 | "Gaia DR2 2129273849959814656" | 2129273849959814656 | 257679430 | 2015.5 | 291.934 | 0.0218949 | 48.1417 |
Joins
Joining tables is a fundamental concept that can be applied either to DataFrames stored locally or as part of SQL/ADQL queries.
innerjoin&semijoin: Return rows for values of the key that exist in both tablesouterjoin: Return rows for values of the key that exist in either tableleftjoin: Return rows for values of the key that exist in first tablerightjoin: Return rows for values of the key that exist in second table
antijoin: Return rows Return rows for values of the key that exist in first table but not the second table
crossjoin: Return table with every row from first table as rows and every row from second table as columns
Examples
Join on Gaia ID
df_ex3_inner = innerjoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
| pl_name | gaia_id | sy_kepmag | ra | dec | solution_id | source_id | random_index | ... | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | "Kepler-227 b" | "Gaia DR2 2129273849959814656" | 15.347 | 291.934 | 48.1417 | 1635721458409799680 | 2129273849959814656 | 257679430 | |
| 2 | "Kepler-227 c" | "Gaia DR2 2129273849959814656" | 15.347 | 291.934 | 48.1417 | 1635721458409799680 | 2129273849959814656 | 257679430 |
names(df_ex3_inner)
100-element Vector{String}:
"pl_name"
"gaia_id"
"sy_kepmag"
"ra"
"dec"
"solution_id"
"source_id"
⋮
"radius_percentile_upper"
"lum_val"
"lum_percentile_lower"
"lum_percentile_upper"
"datalink_url"
"epoch_photometry_url"
Originally, both tables contained columns named ra and dec. The joined table contains columns ra and ra_1 (and dec and dec_1) because we set makeunique.
df_ex3_left = leftjoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
| pl_name | gaia_id | sy_kepmag | ra | dec | solution_id | source_id | random_index | ... | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | "Kepler-227 b" | "Gaia DR2 2129273849959814656" | 15.347 | 291.934 | 48.1417 | 1635721458409799680 | 2129273849959814656 | 257679430 | |
| 2 | "Kepler-227 c" | "Gaia DR2 2129273849959814656" | 15.347 | 291.934 | 48.1417 | 1635721458409799680 | 2129273849959814656 | 257679430 | |
| 3 | "OGLE-TR-10 b" | "Gaia DR2 4056443366649948160" | missing | 267.868 | -29.8765 | missing | missing | missing | |
| 4 | "BD-08 2823 c" | "Gaia DR2 3770419611540574080" | missing | 150.197 | -9.51657 | missing | missing | missing | |
| 5 | "HD 110014 b" | "Gaia DR2 3676091134604409728" | missing | 189.811 | -7.99567 | missing | missing | missing | |
| 6 | "GJ 667 C f" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 | missing | missing | missing | |
| 7 | "GJ 667 C g" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 | missing | missing | missing | |
| 8 | "GJ 667 C c" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 | missing | missing | missing | |
| 9 | "HATS-61 b" | "Gaia DR2 4890849134501995392" | missing | 61.6578 | -25.3497 | missing | missing | missing | |
| 10 | "HATS-63 b" | "Gaia DR2 4891362198412001408" | missing | 67.4186 | -28.1972 | missing | missing | missing | |
| ... | |||||||||
| 5876 | "Kepler-592 b" | "Gaia DR2 2131920370147385472" | 14.665 | 285.791 | 48.6832 | missing | missing | missing |
df_ex3_right = rightjoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
| pl_name | gaia_id | sy_kepmag | ra | dec | solution_id | source_id | random_index | ... | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | "Kepler-227 b" | "Gaia DR2 2129273849959814656" | 15.347 | 291.934 | 48.1417 | 1635721458409799680 | 2129273849959814656 | 257679430 | |
| 2 | "Kepler-227 c" | "Gaia DR2 2129273849959814656" | 15.347 | 291.934 | 48.1417 | 1635721458409799680 | 2129273849959814656 | 257679430 |
try
df_ex3_outer_fails = outerjoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
catch ex
warning_box(ex.msg)
end
df_ex3_outer = outerjoin(filter(r->!ismissing(r.gaia_id),df_ex1),df_ex2, on=:gaia_id=>:designation, makeunique=true )
| pl_name | gaia_id | sy_kepmag | ra | dec | solution_id | source_id | random_index | ... | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | "Kepler-227 b" | "Gaia DR2 2129273849959814656" | 15.347 | 291.934 | 48.1417 | 1635721458409799680 | 2129273849959814656 | 257679430 | |
| 2 | "Kepler-227 c" | "Gaia DR2 2129273849959814656" | 15.347 | 291.934 | 48.1417 | 1635721458409799680 | 2129273849959814656 | 257679430 | |
| 3 | "OGLE-TR-10 b" | "Gaia DR2 4056443366649948160" | missing | 267.868 | -29.8765 | missing | missing | missing | |
| 4 | "BD-08 2823 c" | "Gaia DR2 3770419611540574080" | missing | 150.197 | -9.51657 | missing | missing | missing | |
| 5 | "HD 110014 b" | "Gaia DR2 3676091134604409728" | missing | 189.811 | -7.99567 | missing | missing | missing | |
| 6 | "GJ 667 C f" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 | missing | missing | missing | |
| 7 | "GJ 667 C g" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 | missing | missing | missing | |
| 8 | "GJ 667 C c" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 | missing | missing | missing | |
| 9 | "HATS-61 b" | "Gaia DR2 4890849134501995392" | missing | 61.6578 | -25.3497 | missing | missing | missing | |
| 10 | "HATS-63 b" | "Gaia DR2 4891362198412001408" | missing | 67.4186 | -28.1972 | missing | missing | missing | |
| ... | |||||||||
| 5578 | "Kepler-592 b" | "Gaia DR2 2131920370147385472" | 14.665 | 285.791 | 48.6832 | missing | missing | missing |
df_ex3_anti = antijoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
| pl_name | gaia_id | sy_kepmag | ra | dec | |
|---|---|---|---|---|---|
| 1 | "OGLE-TR-10 b" | "Gaia DR2 4056443366649948160" | missing | 267.868 | -29.8765 |
| 2 | "BD-08 2823 c" | "Gaia DR2 3770419611540574080" | missing | 150.197 | -9.51657 |
| 3 | "HD 110014 b" | "Gaia DR2 3676091134604409728" | missing | 189.811 | -7.99567 |
| 4 | "GJ 667 C f" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 |
| 5 | "GJ 667 C g" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 |
| 6 | "GJ 667 C c" | "Gaia DR2 5975663354131618304" | missing | 259.751 | -34.9978 |
| 7 | "HATS-61 b" | "Gaia DR2 4890849134501995392" | missing | 61.6578 | -25.3497 |
| 8 | "HATS-63 b" | "Gaia DR2 4891362198412001408" | missing | 67.4186 | -28.1972 |
| 9 | "HD 108236 c" | "Gaia DR2 6125644402384918784" | missing | 186.574 | -51.3631 |
| 10 | "TOI-1224 b" | "Gaia DR2 4620009665047355520" | missing | 37.1978 | -80.8991 |
| ... | |||||
| 5874 | "Kepler-592 b" | "Gaia DR2 2131920370147385472" | 14.665 | 285.791 | 48.6832 |
ncol(df_ex1), ncol(df_ex2), ncol(df_ex3_inner), ncol(df_ex3_outer), ncol(df_ex3_left), ncol(df_ex3_anti)
(5, 96, 100, 100, 100, 5)
nrow(df_ex1), nrow(df_ex2)
(5876, 1)
nrow(df_ex3_inner), nrow(df_ex3_outer), nrow(filter(r->!ismissing(r.gaia_id),df_ex1))
(2, 5578, 5578)
nrow(df_ex3_left), nrow(df_ex3_right)
(5876, 2)
Find by position
targetpos = (; ra = df_ex1.ra[1], dec = df_ex1.dec[1] )
(ra = 267.8677483, dec = -29.8764758)
url_ex4 = make_tap_query_url(gaia_query_base_url, "gaiadr3.gaia_source", where="1=contains(POINT($(targetpos.ra),$(targetpos.dec)),CIRCLE(ra,dec,30./3600.))", select_cols="*,DISTANCE(POINT($(targetpos.ra),$(targetpos.dec)),POINT(ra,dec))+AS+ang_sep",order_by_cols="ang_sep",max_rows=100, format="csv")
"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY=select+top+100+*,DISTANCE(POINT(267.8677483,-29.8764758),POINT(ra,dec))+AS+ang_sep+from+gaiadr3.gaia_source+where+1=contains(POINT(267.8677483,-29.8764758),CIRCLE(ra,dec,30./3600.))+order+by+ang_sep&format=csv"
df_ex4 = query_to_df(url_ex4)
| solution_id | designation | source_id | random_index | ref_epoch | ra | ra_error | dec | ... | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1636148068921376768 | "Gaia DR3 4056443366649948160" | 4056443366649948160 | 628955980 | 2016.0 | 267.868 | 0.0484435 | -29.8765 | |
| 2 | 1636148068921376768 | "Gaia DR3 4056443366695974528" | 4056443366695974528 | 1783248586 | 2016.0 | 267.867 | 0.301148 | -29.8763 | |
| 3 | 1636148068921376768 | "Gaia DR3 4056443366694853888" | 4056443366694853888 | 82166964 | 2016.0 | 267.868 | 0.597072 | -29.8772 | |
| 4 | 1636148068921376768 | "Gaia DR3 4056443366696002944" | 4056443366696002944 | 586843531 | 2016.0 | 267.869 | 0.238754 | -29.8765 | |
| 5 | 1636148068921376768 | "Gaia DR3 4056443370989465856" | 4056443370989465856 | 1370584052 | 2016.0 | 267.869 | 9.60017 | -29.8771 | |
| 6 | 1636148068921376768 | "Gaia DR3 4056443366695975424" | 4056443366695975424 | 1032450251 | 2016.0 | 267.869 | 0.207094 | -29.8754 | |
| 7 | 1636148068921376768 | "Gaia DR3 4056443366694961152" | 4056443366694961152 | 1119708763 | 2016.0 | 267.867 | 0.596232 | -29.8779 | |
| 8 | 1636148068921376768 | "Gaia DR3 4056443370989465728" | 4056443370989465728 | 589362746 | 2016.0 | 267.869 | 5.98055 | -29.8774 | |
| 9 | 1636148068921376768 | "Gaia DR3 4056443366694892160" | 4056443366694892160 | 781324219 | 2016.0 | 267.869 | 0.40115 | -29.8777 | |
| 10 | 1636148068921376768 | "Gaia DR3 4056443370989472384" | 4056443370989472384 | 1372304020 | 2016.0 | 267.866 | 0.953113 | -29.8758 | |
| ... | |||||||||
| 100 | 1636148068921376768 | "Gaia DR3 4056443401055771008" | 4056443401055771008 | 1195152038 | 2016.0 | 267.871 | 0.937059 | -29.8725 |
Which row is the best match?
sort(df_ex4[!,[:designation,:ang_sep,:phot_g_mean_mag] ], :ang_sep)
| designation | ang_sep | phot_g_mean_mag | |
|---|---|---|---|
| 1 | "Gaia DR3 4056443366649948160" | 7.49048e-7 | 15.669 |
| 2 | "Gaia DR3 4056443366695974528" | 0.000588803 | 18.363 |
| 3 | "Gaia DR3 4056443366694853888" | 0.000686453 | 18.3986 |
| 4 | "Gaia DR3 4056443366696002944" | 0.00116672 | 18.6382 |
| 5 | "Gaia DR3 4056443370989465856" | 0.00118059 | 20.2377 |
| 6 | "Gaia DR3 4056443366695975424" | 0.00138877 | 18.5433 |
| 7 | "Gaia DR3 4056443366694961152" | 0.00148488 | 19.7284 |
| 8 | "Gaia DR3 4056443370989465728" | 0.00152604 | 20.2813 |
| 9 | "Gaia DR3 4056443366694892160" | 0.00153313 | 19.1686 |
| 10 | "Gaia DR3 4056443370989472384" | 0.00170832 | 18.3691 |
| ... | |||
| 100 | "Gaia DR3 4056443401055771008" | 0.00495741 | 19.686 |
df_ex1.sy_kepmag[row_to_study]
15.347
Questions?
Setup & Helper Code
begin
nexsci_query_base_url = "https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query="
gaia_query_base_url =
"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY="
end;
select_cols_for_tap(cols)
Returns a string of comma-separated columns names from a vector of columns names (as either strings or symbols), for using in a TAP query.
Built with Julia 1.11.5 and
CSV 0.10.15DataFrames 1.7.0
HTTP 1.10.15
PlutoTeachingTools 0.3.1
PlutoUI 0.7.61
Query 1.0.0
To run this tutorial locally, download this file and open it with Pluto.jl.