LVRA:r0b¶

Note that this documentation is mostly aimed at developers that will have to maintain or further develop the code base

Feature making logic¶

The code invovled in making the features for r0b is split between the main pypeline/r0b_feature_maker.py module, and the utils/features.py utility module. pypeline/r0b_feature_maker.py contains operational logic and some feature engineering/cleaning: * [Operational]: loading settings, connecting to the databse, querying the database to know which files (stems) have yet to be processed, logging, etc. * [Feature engineering/cleaning]: calculating new quantities (delta MJD between first detection and current observation), dropping unwanted columns

The misc/features.py does all the dirty work of data cleaning/formating as well as doing feature engineering on the full lightcurve of each alert.

To help visualise the journey of the data, I present below the tables of data at different stages of the feature making process, refering to them by the variable name used in the code. This page is mostly a helper for developers learning the code base, so they can visualise what the pandas dataframes look like at each stage.

The data from the JSON file¶

As mentioned in the Pipeline Overview, the data we get from the kafka stream has a partially nested structure. There are a number of features we get directly from our Lasair filter (see query in Appendix A of the Pipeline Overview), such as the diaObjectId, latestR, ebv, etc… There is also a field called alerts which contains the full alert packet from Rubin* (except image cut-outs).

If you are not familiar with the structure of these data I recommend checking my Lasair Kafka tutorial notebook (Section 6: Full Packet Data).

In the utility function lvra.utils.features.json2cleandf(), the json_df variable contains the FULL JSON file loaded into a dataframe, and it looks something like this [NOTE - you can scroll sideways]:

	diaObjectId	lastDiaSourceMjdTai	latestR	nDiaSources	ebv	ra	decl	tns_name	absMag	absMagMJD	firstDiaSourceMjdTai	separationArcsec	direct_distance	distance	z	photoZ	photoZErr	physical_separation_kpc	sherlock_classifications	raErr	decErr	ra_dec_Cov	UTC	alerts
0	169760231711572988	61029.4	0.838478	2	0.103305	223.024	-41.4592	nan	nan	nan	61029.3	nan	nan	nan	nan	nan	nan	nan	ORPHAN	2.14003e-06	2.82303e-06	4.63027e-13	2026-02-04 11:51:46	NaN
1	169355603258900842	61069.1	0.996714	33	0.00812477	9.22276	-45.5966	nan	nan	nan	60937.3	nan	nan	nan	nan	nan	nan	nan	ORPHAN	1.37756e-05	2.42808e-05	-8.48339e-11	2026-02-04 11:53:03	{‘diaObject’: {‘diaObjectId’: 1697602317115729…

Here I’ve given two examples of alerts, one where the alerts field (Rubin alert packet) is empty, one where it is populated with a nested dictionary (you can’t see the whole thing it’s massive).

The alert packets should not be missing, but it has happened in the past during comissioning, so the code is built to be robust to this and allow the processing to continue and just ignore this specific event (a.k.a diaObject), logging the diaObjectId with missing packets in the logs.

The Lasair filter features¶

We can easily recover the columns that we selected in our Lasair filter query by saveing all the columns except the last ones to a dataframe

filterOutput_df = json_df.iloc[:,:-1], which we will later join with the other features from the lightcurve.

I am not providing a table because it is the same as above, save for the last column being dropped.

The features associated with the latest lightcurve point¶

To access all of the lightcurve features provided by Rubin (e.g. SNR, psf fitting related fields, etc… ) we need to access the nested dictionaries in the alerts columns, specifically the diaSourcesList table which contains the packet data for lightcurve points.

The first kind of feature we want to recover, is all of the data about the most recent source. Below I show a truncated example of the dataframe containing the lightcurve features. If you want to see the full list of columns, go to the Rubin schema explorer; here is a link to the diaSource table schema In the json2cleandf() function, the dataframe with this format is called latestSourceIds_df because it is constructed such that each row gives use the values for the last lightcurve point (diaSourceId) for a given astrophysical event (diaObjectId).

	diaSourceId	visit	detector	diaObjectId	ssObjectId	parentDiaSourceId	midpointMjdTai	ra	raErr	decErr	ra_dec_Cov	x	xErr	y	yErr	centroid_flag	apFlux	apFluxErr	apFlux_flag	apFlux_flag_apertureTruncated	isNegative	snr	psfFlux	psfFluxErr	…	decl
0	169562325818802242	2025110400302	183	169562325818802242	nan	0	60984.2	10.1346	3.25122e-05	3.94188e-05	-6.12409e-10	2451.23	0.917773	2333.35	0.539348	False	1246.3	1128.77	False	False	False	6.38317	3336.28	526.842	…	-42.4024
1	169584317122478660	2025110900310	178	169562325818802242	nan	0	60989.2	10.1347	1.41809e-05	1.91613e-05	-3.45086e-13	3742.17	0.345846	2336.37	0.345686	False	2743.54	485.996	False	False	False	10.5507	2424.67	228.346	…	-42.4024

Note on dataframe concatenation

In the code latestSourceIds_df is constructed by concatenating a list of single-row dataframes created in a loop and stored in a list (latestSourceId_dfList) before final concatenation. This is because concatenating increasingly large dataframes in a loop is computationally inefficient and the loops gets increasingly slow (or at least it’s been my experience, maybe by the time you read this they’ll have fixed it). AFAIK it is because the concatenation operation makes a copy of the dataframe, so if you are dealing with an increasingly large table to copying gets more and more expensive.

By contrast python list elements just “point” (there’s no actual pointers in python) to where that element is stored in memory. Adding to that list is super cheap. Then the final concatenation is just one operation after completion of the for-loop.

The features associated with the lightcurve history¶

Although the current version of r0b only uses the Lasair filter columns and the lightcurve features associated with the latest point to calculate a score, the annotator is build to provide additional flags and information that is not otherwise available in the Lasair filters.

Namely:

n_gtXX (e.g. n_gt22): The number of lightcurve points with magnitude BIRGHTER than 22nd, 21st, 20th, 19th or 18th magnitude.
brighterXX (e.g. brighter22): A boolean flag indicating whether there is at least one point brighter than 22nd, 21st, 20th, 19th or 18th magnitude.
firstXX (e.g. first22): A boolean flag indicating whether this is the first time the transient has crossed a given magnitude threshold_flags_provenance

Important

Note that the “brighter22” (or other threshold) flag tells you if an object has EVER been brighter than that threshold. It may be fainter now. The reason is that in your filter you can easily check with the flux for a current alert is brighter than a chosen threshold but you do not have an easy way to set a condition on the lightcurve history. You would have to download the lightcurve data through your kafka stream and process the data. Since I’m processing all that data anyway, I might as well provide that information as part of the annotator.

These are constructed by the utility function utils.features.flux_threshold_features() which is called within utils.features.json2cleandf() where these flags are joined to the final clean dataframe along filterOutput_df and latestSourceIds_df.

The clean data frame that comes out of the json2cleandf() function looks something like this. Note I have truncated it, showing that the leftmost columns come from the latestSourceIds_df dataframe, the middle columns come from the filterOutput_df dataframe, and the rightmost columns are the flags associated with the lightcurve history.

	diaSourceId	visit	detector	ssObjectId	…	diaObjectId	lastDiaSourceMjdTai	latestR	nDiaSources	ebv	…	is_above_20	first_time_20	N_above_20	…
0	169562325818802242	2025110400302	183	nan	…	169562325818802242	2025110400302	183	169562325818802242	nan	…	True	True	1	…

Delta MJD feature and dropped columns¶

Before creating the csv that will be used to calculated the r0b score, we also calculate the delta MJD ['deltaDiaSourceMjdTai'], and then we drop the following columns:

COLUMNS_TO_REMOVE = ['visit',
                 'tns_name',
                 'ssObjectId',
                 'parentDiaSourceId',
                 'midpointMjdTai',
                 'timeProcessedMjdTai',
                 'timeWithdrawnMjdTai',
                 'firstDiaSourceMjdTai',
                 'ra_sourceId',
                 'raErr_sourceId',
                 'decErr_sourceId',
                 'ra_dec_Cov_sourceId',
                 'UTC',
                ]