Data Science Applications to Astronomy

Week 5: Data Wrangling:

Databases & Queries

Databases

Simplest form

  • Store data

  • Retrieve data

  • Commonly implemented as a set of tables

    • Columns contain different fields (e.g., ID, magnitude, RA, Dec)

    • Rows contain entries (e.g., 51 Pegasi, Kepler-10, HAT-P-13,... )

Value-added features

  • Return subset of data efficiently

  • Many strategies for how to filter data (e.g., order of operations)

  • Database server can use heuristics to pick good strategy

  • Allow for transactions to update database

Fundamental properties of databases

  • Atomicity: All part of a transaction succeed, or the database is rolled back to its previous state

  • Consistency: Data in database always satisfies its validation rules

  • Isolation: Even if multiple transactions are made concurrently, there is no interference between transactions

  • Durability: Once a transaction is committed, it will remain committed

SQL-based Database Servers

Open-source:

  • MySQL

  • PostgreSQL & Greenplum

Commercial:

  • Microsoft SQL Server:

  • IBM DB2

  • Oracle Database

  • ...

Continuing innovation in database systems

  • SciDB (array & disk based database)

  • MonetDB (column store)

  • JuliaDB (pure Julia, for persistent data)

When selecting a database for a project, consider:

  • How much data is to be stored?

  • How frequent/large will transactions be?

  • Are there specific hardware or OS requirements?

  • Does the team have someone dedicated to supporting database?

Database Clients

  • One database server many clients simultaneously

  • Different clients can use different interfaces

    • Command line

    • Webpage

    • URL-based

    • Custom Graphical user interface (GUI)

      • TopCat

Queries

Query: An expression that requests database to return a specific subset of data.

Query languages:

  • Structured Query Language (SQL): Dated, but by far the most common

  • Astronomical Data Query Language (ADQL): Astronomy-specific

  • Language Integrated Query (LINQ): Microsoft-supported

  • Many more

SQL essentials

  • Selecting (columns)

  • Filtering (for rows)

  • Joining (multiple tables)

  • Aggregating (rows within a table)

SQL programming

  • Variables

  • Functions

  • Procedures

  • Data management

  • Transactions

Virtual Observatory (VO)

Defines standards that help astronomers to collaborate effectively, emphasizing working with multiple data sources.

  • Astronomical Data Query Language (ADQL)

  • Table Access Protocol (TAP)

Astronomy-specific functions in ADQL

  • AREA

  • BOX

  • CENTROID

  • CIRCLE

  • CONTAINS

  • COORD1

  • COORD2

  • COORDSYS

  • DISTANCE

  • INTERSECTS

  • POINT

  • POLYGON

  • REGION

Example Bad SQL Query

Do not send

select * from SomeTable

Why?

Example SQL Queries

Take a quick peak two columns data for first few entries

select top 10 X, Y from SomeTable 

Find extreme values of X

select top 10 X, Y 
from SomeTable 
order by X

Filter which rows are returned using expression

select top 10 x,y
from SomeTable 
where x*x+y*y between 0 and 1
order by x 

Check how many rows are in a table

select COUNT(designation) as N from gaiadr2.gaia_source

Check how many rows satisfy a filter

select COUNT(designation) as n, AVG(astrometric_n_good_obs_al) as astrometric_n_good_obs_al_ave
from gaiadr2.gaia_source
where phot_g_mean_mag < 14

Grouping data to make a histogram

select COUNT(designation) as N, 
       AVG(astrometric_n_good_obs_al) as astrometric_n_good_obs_al_ave,  
       AVG(phot_g_mean_mag) as phot_g_mean_mag_ave,
       ROUND(phot_g_mean_mag,1) as bin
from gaiadr2.gaia_source
where phot_g_mean_mag < 14
group by bin 
order by bin

Table Access Protocol (TAP)

Start with ADQL

SELECT <column list> FROM <table> WHERE <constraints>

but transform it into a url, by

  1. Prepend a base service url

  2. convert spaces to +'s

  3. Deal with other special characters (e.g., +, quotes)

  4. Optionally, specify format for results

e.g.,

https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name,pl_masse,ra,dec+from+ps
url_ex1 = make_tap_query_url(nexsci_query_base_url, "ps", select_cols="pl_name,gaia_id,sy_kepmag,ra,dec", where="default_flag=1")
"https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name,gaia_id,sy_kepmag,ra,dec+from+ps+where+default_flag=1&format=tsv"
df_ex1 = query_to_df(url_ex1)
pl_namegaia_idsy_kepmagradec
1"OGLE-TR-10 b""Gaia DR2 4056443366649948160"missing267.868-29.8765
2"BD-08 2823 c""Gaia DR2 3770419611540574080"missing150.197-9.51657
3"HD 110014 b""Gaia DR2 3676091134604409728"missing189.811-7.99567
4"GJ 667 C f""Gaia DR2 5975663354131618304"missing259.751-34.9978
5"GJ 667 C g""Gaia DR2 5975663354131618304"missing259.751-34.9978
6"GJ 667 C c""Gaia DR2 5975663354131618304"missing259.751-34.9978
7"Kepler-227 b""Gaia DR2 2129273849959814656"15.347291.93448.1417
8"HATS-61 b""Gaia DR2 4890849134501995392"missing61.6578-25.3497
9"HATS-63 b""Gaia DR2 4891362198412001408"missing67.4186-28.1972
10"HD 108236 c""Gaia DR2 6125644402384918784"missing186.574-51.3631
...
5876"Kepler-592 b""Gaia DR2 2131920370147385472"14.665285.79148.6832
row_to_study = 7
7
desig = replace_spaces_for_tap(df_ex1.gaia_id[row_to_study])
"Gaia+DR2+2129273849959814656"
url_ex2 = make_tap_query_url(gaia_query_base_url, "gaiadr2.gaia_source", where="designation='$(desig)'",select_cols="*",max_rows=5,format="csv")
"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY=select+top+5+*+from+gaiadr2.gaia_source+where+designation='Gaia+DR2+2129273849959814656'&format=csv"
df_ex2 = query_to_df(url_ex2)
solution_iddesignationsource_idrandom_indexref_epochrara_errordec...
11635721458409799680"Gaia DR2 2129273849959814656"21292738499598146562576794302015.5291.9340.021894948.1417

Joins

Joining tables is a fundamental concept that can be applied either to DataFrames stored locally or as part of SQL/ADQL queries.

  • innerjoin & semijoin: Return rows for values of the key that exist in both tables

  • outerjoin: Return rows for values of the key that exist in either table

  • leftjoin: Return rows for values of the key that exist in first table

  • rightjoin: Return rows for values of the key that exist in second table

  • antijoin: Return rows Return rows for values of the key that exist in first table but not the second table

  • crossjoin: Return table with every row from first table as rows and every row from second table as columns

Examples

Join on Gaia ID

df_ex3_inner = innerjoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
pl_namegaia_idsy_kepmagradecsolution_idsource_idrandom_index...
1"Kepler-227 b""Gaia DR2 2129273849959814656"15.347291.93448.141716357214584097996802129273849959814656257679430
2"Kepler-227 c""Gaia DR2 2129273849959814656"15.347291.93448.141716357214584097996802129273849959814656257679430
names(df_ex3_inner)
100-element Vector{String}:
 "pl_name"
 "gaia_id"
 "sy_kepmag"
 "ra"
 "dec"
 "solution_id"
 "source_id"
 ⋮
 "radius_percentile_upper"
 "lum_val"
 "lum_percentile_lower"
 "lum_percentile_upper"
 "datalink_url"
 "epoch_photometry_url"
Tip

Originally, both tables contained columns named ra and dec. The joined table contains columns ra and ra_1 (and dec and dec_1) because we set makeunique.

df_ex3_left = leftjoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
pl_namegaia_idsy_kepmagradecsolution_idsource_idrandom_index...
1"Kepler-227 b""Gaia DR2 2129273849959814656"15.347291.93448.141716357214584097996802129273849959814656257679430
2"Kepler-227 c""Gaia DR2 2129273849959814656"15.347291.93448.141716357214584097996802129273849959814656257679430
3"OGLE-TR-10 b""Gaia DR2 4056443366649948160"missing267.868-29.8765missingmissingmissing
4"BD-08 2823 c""Gaia DR2 3770419611540574080"missing150.197-9.51657missingmissingmissing
5"HD 110014 b""Gaia DR2 3676091134604409728"missing189.811-7.99567missingmissingmissing
6"GJ 667 C f""Gaia DR2 5975663354131618304"missing259.751-34.9978missingmissingmissing
7"GJ 667 C g""Gaia DR2 5975663354131618304"missing259.751-34.9978missingmissingmissing
8"GJ 667 C c""Gaia DR2 5975663354131618304"missing259.751-34.9978missingmissingmissing
9"HATS-61 b""Gaia DR2 4890849134501995392"missing61.6578-25.3497missingmissingmissing
10"HATS-63 b""Gaia DR2 4891362198412001408"missing67.4186-28.1972missingmissingmissing
...
5876"Kepler-592 b""Gaia DR2 2131920370147385472"14.665285.79148.6832missingmissingmissing
df_ex3_right = rightjoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
pl_namegaia_idsy_kepmagradecsolution_idsource_idrandom_index...
1"Kepler-227 b""Gaia DR2 2129273849959814656"15.347291.93448.141716357214584097996802129273849959814656257679430
2"Kepler-227 c""Gaia DR2 2129273849959814656"15.347291.93448.141716357214584097996802129273849959814656257679430
try
    df_ex3_outer_fails = outerjoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
catch ex
    warning_box(ex.msg)
end
Warning:
"matchmissing == :notequal for `outerjoin` is not allowed"
df_ex3_outer = outerjoin(filter(r->!ismissing(r.gaia_id),df_ex1),df_ex2, on=:gaia_id=>:designation, makeunique=true )
pl_namegaia_idsy_kepmagradecsolution_idsource_idrandom_index...
1"Kepler-227 b""Gaia DR2 2129273849959814656"15.347291.93448.141716357214584097996802129273849959814656257679430
2"Kepler-227 c""Gaia DR2 2129273849959814656"15.347291.93448.141716357214584097996802129273849959814656257679430
3"OGLE-TR-10 b""Gaia DR2 4056443366649948160"missing267.868-29.8765missingmissingmissing
4"BD-08 2823 c""Gaia DR2 3770419611540574080"missing150.197-9.51657missingmissingmissing
5"HD 110014 b""Gaia DR2 3676091134604409728"missing189.811-7.99567missingmissingmissing
6"GJ 667 C f""Gaia DR2 5975663354131618304"missing259.751-34.9978missingmissingmissing
7"GJ 667 C g""Gaia DR2 5975663354131618304"missing259.751-34.9978missingmissingmissing
8"GJ 667 C c""Gaia DR2 5975663354131618304"missing259.751-34.9978missingmissingmissing
9"HATS-61 b""Gaia DR2 4890849134501995392"missing61.6578-25.3497missingmissingmissing
10"HATS-63 b""Gaia DR2 4891362198412001408"missing67.4186-28.1972missingmissingmissing
...
5578"Kepler-592 b""Gaia DR2 2131920370147385472"14.665285.79148.6832missingmissingmissing
df_ex3_anti = antijoin(df_ex1,df_ex2, on=:gaia_id=>:designation, matchmissing=:notequal, makeunique=true )
pl_namegaia_idsy_kepmagradec
1"OGLE-TR-10 b""Gaia DR2 4056443366649948160"missing267.868-29.8765
2"BD-08 2823 c""Gaia DR2 3770419611540574080"missing150.197-9.51657
3"HD 110014 b""Gaia DR2 3676091134604409728"missing189.811-7.99567
4"GJ 667 C f""Gaia DR2 5975663354131618304"missing259.751-34.9978
5"GJ 667 C g""Gaia DR2 5975663354131618304"missing259.751-34.9978
6"GJ 667 C c""Gaia DR2 5975663354131618304"missing259.751-34.9978
7"HATS-61 b""Gaia DR2 4890849134501995392"missing61.6578-25.3497
8"HATS-63 b""Gaia DR2 4891362198412001408"missing67.4186-28.1972
9"HD 108236 c""Gaia DR2 6125644402384918784"missing186.574-51.3631
10"TOI-1224 b""Gaia DR2 4620009665047355520"missing37.1978-80.8991
...
5874"Kepler-592 b""Gaia DR2 2131920370147385472"14.665285.79148.6832
ncol(df_ex1), ncol(df_ex2), ncol(df_ex3_inner), ncol(df_ex3_outer), ncol(df_ex3_left), ncol(df_ex3_anti)
(5, 96, 100, 100, 100, 5)
nrow(df_ex1), nrow(df_ex2)
(5876, 1)
nrow(df_ex3_inner), nrow(df_ex3_outer), nrow(filter(r->!ismissing(r.gaia_id),df_ex1))
(2, 5578, 5578)
nrow(df_ex3_left), nrow(df_ex3_right)
(5876, 2)

Find by position

targetpos = (; ra = df_ex1.ra[1], dec = df_ex1.dec[1] )
(ra = 267.8677483, dec = -29.8764758)
url_ex4 = make_tap_query_url(gaia_query_base_url, "gaiadr3.gaia_source", where="1=contains(POINT($(targetpos.ra),$(targetpos.dec)),CIRCLE(ra,dec,30./3600.))", select_cols="*,DISTANCE(POINT($(targetpos.ra),$(targetpos.dec)),POINT(ra,dec))+AS+ang_sep",order_by_cols="ang_sep",max_rows=100, format="csv")
"https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY=select+top+100+*,DISTANCE(POINT(267.8677483,-29.8764758),POINT(ra,dec))+AS+ang_sep+from+gaiadr3.gaia_source+where+1=contains(POINT(267.8677483,-29.8764758),CIRCLE(ra,dec,30./3600.))+order+by+ang_sep&format=csv"
df_ex4 = query_to_df(url_ex4)
solution_iddesignationsource_idrandom_indexref_epochrara_errordec...
11636148068921376768"Gaia DR3 4056443366649948160"40564433666499481606289559802016.0267.8680.0484435-29.8765
21636148068921376768"Gaia DR3 4056443366695974528"405644336669597452817832485862016.0267.8670.301148-29.8763
31636148068921376768"Gaia DR3 4056443366694853888"4056443366694853888821669642016.0267.8680.597072-29.8772
41636148068921376768"Gaia DR3 4056443366696002944"40564433666960029445868435312016.0267.8690.238754-29.8765
51636148068921376768"Gaia DR3 4056443370989465856"405644337098946585613705840522016.0267.8699.60017-29.8771
61636148068921376768"Gaia DR3 4056443366695975424"405644336669597542410324502512016.0267.8690.207094-29.8754
71636148068921376768"Gaia DR3 4056443366694961152"405644336669496115211197087632016.0267.8670.596232-29.8779
81636148068921376768"Gaia DR3 4056443370989465728"40564433709894657285893627462016.0267.8695.98055-29.8774
91636148068921376768"Gaia DR3 4056443366694892160"40564433666948921607813242192016.0267.8690.40115-29.8777
101636148068921376768"Gaia DR3 4056443370989472384"405644337098947238413723040202016.0267.8660.953113-29.8758
...
1001636148068921376768"Gaia DR3 4056443401055771008"405644340105577100811951520382016.0267.8710.937059-29.8725

Which row is the best match?

sort(df_ex4[!,[:designation,:ang_sep,:phot_g_mean_mag] ], :ang_sep)
designationang_sepphot_g_mean_mag
1"Gaia DR3 4056443366649948160"7.49048e-715.669
2"Gaia DR3 4056443366695974528"0.00058880318.363
3"Gaia DR3 4056443366694853888"0.00068645318.3986
4"Gaia DR3 4056443366696002944"0.0011667218.6382
5"Gaia DR3 4056443370989465856"0.0011805920.2377
6"Gaia DR3 4056443366695975424"0.0013887718.5433
7"Gaia DR3 4056443366694961152"0.0014848819.7284
8"Gaia DR3 4056443370989465728"0.0015260420.2813
9"Gaia DR3 4056443366694892160"0.0015331319.1686
10"Gaia DR3 4056443370989472384"0.0017083218.3691
...
100"Gaia DR3 4056443401055771008"0.0049574119.686
df_ex1.sy_kepmag[row_to_study]
15.347

Questions?

Setup & Helper Code

begin
    nexsci_query_base_url = "https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query="
    gaia_query_base_url = 
    "https://gea.esac.esa.int/tap-server/tap/sync?REQUEST=doQuery&LANG=ADQL&FORMAT=csv&QUERY="
end;

select_cols_for_tap(cols)

Returns a string of comma-separated columns names from a vector of columns names (as either strings or symbols), for using in a TAP query.

Built with Julia 1.11.5 and

CSV 0.10.15
DataFrames 1.7.0
HTTP 1.10.15
PlutoTeachingTools 0.3.1
PlutoUI 0.7.61
Query 1.0.0

To run this tutorial locally, download this file and open it with Pluto.jl.