LVRA:r0b¶
Note that this documentation is mostly aimed at developers that will have to maintain or further develop the code base
Feature making logic¶
The code invovled in making the features for r0b is split between the main pypeline/r0b_feature_maker.py module, and the utils/features.py utility module.
pypeline/r0b_feature_maker.py contains operational logic and some feature engineering/cleaning:
* [Operational]: loading settings, connecting to the databse, querying the database to know which files (stems) have yet to be processed, logging, etc.
* [Feature engineering/cleaning]: calculating new quantities (delta MJD between first detection and current observation), dropping unwanted columns
The misc/features.py does all the dirty work of data cleaning/formating as well as doing feature engineering on the full lightcurve of each alert.
To help visualise the journey of the data, I present below the tables of data at different stages of the feature making process, refering to them by the variable name used in the code. This page is mostly a helper for developers learning the code base, so they can visualise what the pandas dataframes look like at each stage.
The data from the JSON file¶
As mentioned in the Pipeline Overview, the data we get from the kafka stream has a partially nested structure.
There are a number of features we get directly from our Lasair filter (see query in Appendix A of the Pipeline Overview), such as the diaObjectId, latestR, ebv, etc…
There is also a field called alerts which contains the full alert packet from Rubin* (except image cut-outs).
If you are not familiar with the structure of these data I recommend checking my Lasair Kafka tutorial notebook (Section 6: Full Packet Data).
In the utility function lvra.utils.features.json2cleandf(), the json_df variable contains the FULL JSON file loaded into a dataframe,
and it looks something like this [NOTE - you can scroll sideways]:
diaObjectId |
lastDiaSourceMjdTai |
latestR |
nDiaSources |
ebv |
ra |
decl |
tns_name |
absMag |
absMagMJD |
firstDiaSourceMjdTai |
separationArcsec |
direct_distance |
distance |
z |
photoZ |
photoZErr |
physical_separation_kpc |
sherlock_classifications |
raErr |
decErr |
ra_dec_Cov |
UTC |
alerts |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
169760231711572988 |
61029.4 |
0.838478 |
2 |
0.103305 |
223.024 |
-41.4592 |
nan |
nan |
nan |
61029.3 |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
ORPHAN |
2.14003e-06 |
2.82303e-06 |
4.63027e-13 |
2026-02-04 11:51:46 |
NaN |
1 |
169355603258900842 |
61069.1 |
0.996714 |
33 |
0.00812477 |
9.22276 |
-45.5966 |
nan |
nan |
nan |
60937.3 |
nan |
nan |
nan |
nan |
nan |
nan |
nan |
ORPHAN |
1.37756e-05 |
2.42808e-05 |
-8.48339e-11 |
2026-02-04 11:53:03 |
{‘diaObject’: {‘diaObjectId’: 1697602317115729… |
Here I’ve given two examples of alerts, one where the alerts field (Rubin alert packet) is empty, one where it is populated with a nested dictionary (you can’t see the whole thing
it’s massive).
The alert packets should not be missing, but it has happened in the past during comissioning, so the code is built to be robust to this
and allow the processing to continue and just ignore this specific event (a.k.a diaObject), logging the diaObjectId with missing packets in the logs.
The Lasair filter features¶
- We can easily recover the columns that we selected in our Lasair filter query by saveing all the columns except the last ones to a dataframe
filterOutput_df = json_df.iloc[:,:-1], which we will later join with the other features from the lightcurve.I am not providing a table because it is the same as above, save for the last column being dropped.
The features associated with the latest lightcurve point¶
To access all of the lightcurve features provided by Rubin (e.g. SNR, psf fitting related fields, etc… )
we need to access the nested dictionaries in the alerts columns, specifically the diaSourcesList table which contains
the packet data for lightcurve points.
The first kind of feature we want to recover, is all of the data about the most recent source.
Below I show a truncated example of the dataframe containing the lightcurve features.
If you want to see the full list of columns, go to the Rubin schema explorer; here is a link to the diaSource table schema
In the json2cleandf() function, the dataframe with this format is called latestSourceIds_df because it is constructed such that each row
gives use the values for the last lightcurve point (diaSourceId) for a given astrophysical event (diaObjectId).
diaSourceId |
visit |
detector |
diaObjectId |
ssObjectId |
parentDiaSourceId |
midpointMjdTai |
ra |
raErr |
decErr |
ra_dec_Cov |
x |
xErr |
y |
yErr |
centroid_flag |
apFlux |
apFluxErr |
apFlux_flag |
apFlux_flag_apertureTruncated |
isNegative |
snr |
psfFlux |
psfFluxErr |
… |
decl |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
169562325818802242 |
2025110400302 |
183 |
169562325818802242 |
nan |
0 |
60984.2 |
10.1346 |
3.25122e-05 |
3.94188e-05 |
-6.12409e-10 |
2451.23 |
0.917773 |
2333.35 |
0.539348 |
False |
1246.3 |
1128.77 |
False |
False |
False |
6.38317 |
3336.28 |
526.842 |
… |
-42.4024 |
1 |
169584317122478660 |
2025110900310 |
178 |
169562325818802242 |
nan |
0 |
60989.2 |
10.1347 |
1.41809e-05 |
1.91613e-05 |
-3.45086e-13 |
3742.17 |
0.345846 |
2336.37 |
0.345686 |
False |
2743.54 |
485.996 |
False |
False |
False |
10.5507 |
2424.67 |
228.346 |
… |
-42.4024 |
Note on dataframe concatenation
In the code latestSourceIds_df is constructed by concatenating a list of single-row dataframes created in a loop and stored in a list
(latestSourceId_dfList) before final concatenation. This is because concatenating increasingly large dataframes in a loop is
computationally inefficient and the loops gets increasingly slow (or at least it’s been my experience, maybe by the time you read this
they’ll have fixed it). AFAIK it is because the concatenation operation makes a copy of the dataframe, so if you are dealing
with an increasingly large table to copying gets more and more expensive.
By contrast python list elements just “point” (there’s no actual pointers in python) to where that element is stored in memory. Adding to that list is super cheap. Then the final concatenation is just one operation after completion of the for-loop.
The features associated with the lightcurve history¶
Although the current version of r0b only uses the Lasair filter columns and the lightcurve features associated with the latest point to calculate a score, the annotator is build to provide additional flags and information that is not otherwise available in the Lasair filters.
Namely:
n_gtXX (e.g. n_gt22): The number of lightcurve points with magnitude BIRGHTER than 22nd, 21st, 20th, 19th or 18th magnitude.
brighterXX (e.g. brighter22): A boolean flag indicating whether there is at least one point brighter than 22nd, 21st, 20th, 19th or 18th magnitude.
firstXX (e.g. first22): A boolean flag indicating whether this is the first time the transient has crossed a given magnitude threshold_flags_provenance
Important
Note that the “brighter22” (or other threshold) flag tells you if an object has EVER been brighter than that threshold. It may be fainter now. The reason is that in your filter you can easily check with the flux for a current alert is brighter than a chosen threshold but you do not have an easy way to set a condition on the lightcurve history. You would have to download the lightcurve data through your kafka stream and process the data. Since I’m processing all that data anyway, I might as well provide that information as part of the annotator.
These are constructed by the utility function utils.features.flux_threshold_features() which is called within utils.features.json2cleandf() where
these flags are joined to the final clean dataframe along filterOutput_df and latestSourceIds_df.
The clean data frame that comes out of the json2cleandf() function looks something like this. Note I have truncated it, showing
that the leftmost columns come from the latestSourceIds_df dataframe, the middle columns come from the filterOutput_df dataframe,
and the rightmost columns are the flags associated with the lightcurve history.
diaSourceId |
visit |
detector |
ssObjectId |
… |
diaObjectId |
lastDiaSourceMjdTai |
latestR |
nDiaSources |
ebv |
… |
is_above_20 |
first_time_20 |
N_above_20 |
… |
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
169562325818802242 |
2025110400302 |
183 |
nan |
… |
169562325818802242 |
2025110400302 |
183 |
169562325818802242 |
nan |
… |
True |
True |
1 |
… |
Delta MJD feature and dropped columns¶
Before creating the csv that will be used to calculated the r0b score, we also calculate the delta MJD ['deltaDiaSourceMjdTai'], and
then we drop the following columns:
COLUMNS_TO_REMOVE = ['visit',
'tns_name',
'ssObjectId',
'parentDiaSourceId',
'midpointMjdTai',
'timeProcessedMjdTai',
'timeWithdrawnMjdTai',
'firstDiaSourceMjdTai',
'ra_sourceId',
'raErr_sourceId',
'decErr_sourceId',
'ra_dec_Cov_sourceId',
'UTC',
]
Lasair Virtual Research Assistants