Infrastructure

Architecture Overview

Data

The data are under the LVRA_DATA_ROOT directory. On the Oxford Lasair prod servers that is /home/lasair/data/lvra (it is already defined in the .bashrc).

There directory structure logic is as follows: TYPE > YEAR > DATE

Here the type referes to the file types:

  • JSON: contain the raw JSON alert data from Lasair LSST. These are created by the kafka_consumer.py which ingests a broad Lasair filter called lvra_fodder.

  • csv: contain the feature csv files created from the JSON alert data by each feature makign pipeline. For example for the r0b VRA, there is a r0b_feature_maker.py script.

  • logs: contain log text files.

  • db: contain the SQLite database files. NOTE: this is a flat directory, no timestamped sub dirs here.

Code

The code is under LVRA_CODE_ROOT, it is the lvra python package. There are two kinds of scripts used in production: the python pipelines and their bash wrappers. The bash wrappers are their to set the environment so that when the code is run from cron everything works as expected. They also redirect the stderr to stdout and write it to an error log file so that cron jobs do not fail silently and we can track what is going on.

The bash scripts are under LVRA_CODE_ROOT/bash and the python scripts are under LVRA_CODE_ROOT/lvra/pypeline (not a typo, a pun between python and pipeline. ha. ha.).

A lot of the code needs config files that are stored under LVRA_CODE_ROOT/data. Any secret information such as tokens are directly stored in environment variables on the server, nothing in the files.

Useful Definitions

  • Status Codes: These are integers used in the database tables

Status

Description

0

Initialised

1

Successfully Processed

21

File Not Found (INPUT)

22

File Not Found (OUTPUT)

23

Not Expected Input Data Type

30

Key Error (missing in data)

31

Missing Columns (INPUT)

40

Lasair Annotation Issue

41

Failure to create Lasair client

99

Generic Error

The 2X errors refer a problem with the input or outputs such that they can’t be loaded.

The 3X errors correspond to issues with the data structure or content. So it _was_ loaded, but the contents cause problems

Code 30 likely means the files you are trying to use don’t have the structure you expect. This is most likely due to a change in the alert or clean data format. Causes may vary: changes in LSST data, changes in Lasair, changes in your code.

The 4X errors are specific to Lasair

  • Stems: These are the core names of our files and take the format YYYYMMDD_HHMMSS. Each path name is constructed with the format TYPE/YEAR/DATE/stem.extension. The stem is also used as the primary key for the status tables (see below)

Log files and SQLite database

Log files are written to the LVRA_DATA_ROOT/logs/YEAR/DAY/[logname].log directory.

There is also a SQLite database to keep track of the status of various processes and the history of the predictions of various VRAs. It is located under LVRA_DATA_ROOT/db/log.db.

There are three kinds of tables:

  • Status tables: primary key is the stem and the columns are named after VRAs. Each cell contains a status code (see table above).

  • Mapping table: primary key is the LSST diaObjectId and contains the mapping between the diaObjectId and the stem name. For now there is only one mapping table but I make need mapping between other ids in the future… Note that here the stem is not technically a foreign key because I have not enforced that the stem exists in the status tables.

  • Provenance table: to keep track of the history of our model inferences. [NOT IMPLEMENTED YET]

Table List

  • feature_making [Status table]: Records which alerts files have been successfully processed. New columns can be added for each LVRA

stem (str)

r0b (int)

20260127_105636

1

20260127_111728

0

  • annotating [Status table]: Records which alerts files have been successfully annotated. New columns can be added for each LVRA

stem (str)

r0b (int)

20260127_105636

0

20260127_111728

0

  • diaobjid_stems [Mapping table]: Records the mapping between LSST diaObjectId and the alert stem name.

diaObjectId (int)

stem (str)

169755827469549632

20260128_154837

169843765851193449

20260128_154837

169843765880029260

20260128_154837

Log files

[list log fiels and explain]

Infra Set-up Instructions

Directories

Here is a bash script that can be run in the LVRA_DATA_ROOT of choice to create the full directory sub-stucture.

STEP 1: Save this as make_dirs.sh under the LVRA_DATA_ROOT directory.

#!/usr/bin/env bash

mkdir -p JSON
mkdir -p csv
mkdir -p logs
mkdir -p db

years_arr=(2026
    2027
    2028
    2029
    2030
    2031
    2032
    2033
    2034
    2035
)
work_dir=$(pwd)

today=$(date +"%Y%m%d")
today_year=$(date +"%Y")

for dir in JSON csv logs; do
    for year in "${years_arr[@]}"; do
        mkdir -p "$work_dir/$dir/$year"
        done

    mkdir -p "$work_dir/$dir/$today_year/$today"


done

STEP 2: Run the following commands

chmod u+x make_dirs.sh
./make_dirs.sh

SQLite database

Now go to the databse subdirectory. From LVRA_DATA_ROOT

cd db

Then copy paste this code in a new file called log_schema.sql.

CREATE TABLE IF NOT EXISTS feature_making (
    stem TEXT PRIMARY KEY,
    timestamp TEXT NOT NULL DEFAULT current_timestamp,
    r0b INTEGER
);

CREATE TABLE IF NOT EXISTS annotating (
    stem TEXT PRIMARY KEY,
    timestamp TEXT NOT NULL DEFAULT current_timestamp,
    r0b INTEGER
);

CREATE TABLE IF NOT EXISTS diaobjid_stems (
    diaObjectId INTEGER PRIMARY KEY,
    stem TEXT NOT NULL,
    timestamp TEXT NOT NULL DEFAULT current_timestamp
);

CREATE TABLE IF NOT EXISTS provenance (
    ID INTEGER PRIMARY KEY,
    diaObjectId INTEGER,
    diaSourceId INTEGER,
    stem TEXT,
    score REAL,
    model_name TEXT,
    model_version TEXT,
    timestamp TEXT NOT NULL DEFAULT current_timestamp
);

CREATE TABLE threshold_flags_provenance(
    ID INTEGER PRIMARY KEY,
    diaObjectId INTEGER,
    diaSourceId INTEGER,
    stem TEXT,
    n_gt22 INTEGER,
    n_gt21 INTEGER,
    n_gt20 INTEGER,
    n_gt19 INTEGER,
    n_gt18 INTEGER,
    brighter22 INTEGER,
    brighter21 INTEGER,
    brighter20 INTEGER,
    brighter19 INTEGER,
    brighter18 INTEGER,
    first22 INTEGER,
    first21 INTEGER,
    first20 INTEGER,
    first19 INTEGER,
    first18 INTEGER,
    timestamp TEXT NOT NULL DEFAULT current_timestamp
);

Then run:

sqlite3 log.db < log_schema.sql

Tip

To ensure tables are nicely formatted when you run sqlite3 from the command line you need to add a .sqliterc file in your home directory with the following content: .headers on and .mode column (on two separate lines).

Local dev env

I have not set up a docker (and will not for now) but I have local directories that mimick what is on remote.

Most importantly I have environments that need to be the same.

conda

For the python environment I exported a yaml file of the remote conda environement:

conda env export --no-builds > lvra_env.yml

Then I copied it in my local package:

scp lasair@oxdb1:code/lvra_env.yml ./software/lvra

Then I created the environment with:

conda env create -f software/lvra/lvra_env.yml  -n lvra

env variables

export LVRA_SETTINGS='/home/stevance/software/lvra/data/public_settings_local.yaml'
export LVRA_TRAINING_ROOTDIR='/home/stevance/Science/lvra-training/'
export LASAIR_LSST_TOKEN = [see my .bashrc]
export LVRA_TNS_API_KEY = [see my .bashrc]