Infrastructure ================================== Architecture Overview ------------------------ Data ~~~~~~ The data are under the ``LVRA_DATA_ROOT`` directory. On the Oxford Lasair prod servers that is ``/home/lasair/data/lvra`` (it is already defined in the ``.bashrc``). There directory structure logic is as follows: TYPE > YEAR > DATE Here the type referes to the file types: * ``JSON``: contain the raw JSON **alert data** from Lasair LSST. These are created by the ``kafka_consumer.py`` which ingests a broad Lasair filter called ``lvra_fodder``. * ``csv``: contain the **feature csv files** created from the JSON alert data by each feature makign pipeline. For example for the ``r0b`` VRA, there is a ``r0b_feature_maker.py`` script. * ``logs``: contain log text files. * ``db``: contain the SQLite database files. **NOTE**: this is a flat directory, no timestamped sub dirs here. Code ~~~~~~ The code is under ``LVRA_CODE_ROOT``, it is the ``lvra`` python package. There are two kinds of scripts used in production: the python pipelines and their bash wrappers. The bash wrappers are their to set the environment so that when the code is run from cron everything works as expected. They also redirect the stderr to stdout and write it to an error log file so that cron jobs do not fail silently and we can track what is going on. The bash scripts are under ``LVRA_CODE_ROOT/bash`` and the python scripts are under ``LVRA_CODE_ROOT/lvra/pypeline`` (not a typo, a pun between python and pipeline. ha. ha.). A lot of the code needs config files that are stored under ``LVRA_CODE_ROOT/data``. Any secret information such as tokens are directly stored in environment variables on the server, nothing in the files. Useful Definitions ------------------------ * **Status Codes**: These are integers used in the database tables +--------+---------------------------------+ | Status | Description | +========+=================================+ | 0 | Initialised | +--------+---------------------------------+ | 1 | Successfully Processed | +--------+---------------------------------+ | 21 | File Not Found (INPUT) | +--------+---------------------------------+ | 22 | File Not Found (OUTPUT) | +--------+---------------------------------+ | 23 | Not Expected Input Data Type | +--------+---------------------------------+ | 30 | Key Error (missing in data) | +--------+---------------------------------+ | 31 | Missing Columns (INPUT) | +--------+---------------------------------+ | 40 | Lasair Annotation Issue | +--------+---------------------------------+ | 41 | Failure to create Lasair client | +--------+---------------------------------+ | 99 | Generic Error | +--------+---------------------------------+ The `2X` errors refer a problem with the input or outputs such that they can't be loaded. The `3X` errors correspond to issues with the data structure or content. So it _was_ loaded, but the contents cause problems Code `30` likely means the files you are trying to use don't have the structure you expect. This is most likely due to a change in the alert or clean data format. Causes may vary: changes in LSST data, changes in Lasair, changes in your code. The `4X` errors are specific to Lasair * **Stems**: These are the core names of our files and take the format ``YYYYMMDD_HHMMSS``. Each path name is constructed with the format ``TYPE/YEAR/DATE/stem.extension``. The stem is also used as the primary key for the status tables (see below) Log files and SQLite database ------------------------------ Log files are written to the ``LVRA_DATA_ROOT/logs/YEAR/DAY/[logname].log`` directory. There is also a SQLite database to keep track of the status of various processes and the history of the predictions of various VRAs. It is located under ``LVRA_DATA_ROOT/db/log.db``. There are three kinds of tables: * Status tables: primary key is the stem and the columns are named after VRAs. Each cell contains a status code (see table above). * Mapping table: primary key is the LSST ``diaObjectId`` and contains the mapping between the ``diaObjectId`` and the stem name. For now there is only one mapping table but I make need mapping between other ids in the future... Note that here the stem is not technically a foreign key because I have not enforced that the stem exists in the status tables. * Provenance table: to keep track of the history of our model inferences. [NOT IMPLEMENTED YET] Table List ~~~~~~~~~~~~~~ * ``feature_making`` [Status table]: Records which alerts files have been successfully processed. New columns can be added for each LVRA +-----------------+----------+ | stem (str) | r0b (int)| +=================+==========+ | 20260127_105636 | 1 | +-----------------+----------+ | 20260127_111728 | 0 | +-----------------+----------+ | ............... | ........ | +-----------------+----------+ * ``annotating`` [Status table]: Records which alerts files have been successfully annotated. New columns can be added for each LVRA +-----------------+----------+ | stem (str) | r0b (int)| +=================+==========+ | 20260127_105636 | 0 | +-----------------+----------+ | 20260127_111728 | 0 | +-----------------+----------+ | ............... | ........ | +-----------------+----------+ * ``diaobjid_stems`` [Mapping table]: Records the mapping between LSST ``diaObjectId`` and the alert stem name. +--------------------+-----------------+ | diaObjectId (int) | stem (str) | +====================+=================+ | 169755827469549632 | 20260128_154837 | +--------------------+-----------------+ | 169843765851193449 | 20260128_154837 | +--------------------+-----------------+ | 169843765880029260 | 20260128_154837 | +--------------------+-----------------+ Log files ~~~~~~~~~~~~~~~~~ [list log fiels and explain] Infra Set-up Instructions ------------------------------- Directories ~~~~~~~~~~~~~~ Here is a bash script that can be run in the ``LVRA_DATA_ROOT`` of choice to create the full directory sub-stucture. **STEP 1**: Save this as ``make_dirs.sh`` under the ``LVRA_DATA_ROOT`` directory. .. code-block:: bash #!/usr/bin/env bash mkdir -p JSON mkdir -p csv mkdir -p logs mkdir -p db years_arr=(2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 ) work_dir=$(pwd) today=$(date +"%Y%m%d") today_year=$(date +"%Y") for dir in JSON csv logs; do for year in "${years_arr[@]}"; do mkdir -p "$work_dir/$dir/$year" done mkdir -p "$work_dir/$dir/$today_year/$today" done **STEP 2**: Run the following commands .. code-block:: bash chmod u+x make_dirs.sh ./make_dirs.sh SQLite database ~~~~~~~~~~~~~~~~~ Now go to the databse subdirectory. From ``LVRA_DATA_ROOT`` .. code-block:: bash cd db Then copy paste this code in a new file called ``log_schema.sql``. .. code-block:: sql CREATE TABLE IF NOT EXISTS feature_making ( stem TEXT PRIMARY KEY, timestamp TEXT NOT NULL DEFAULT current_timestamp, r0b INTEGER ); CREATE TABLE IF NOT EXISTS annotating ( stem TEXT PRIMARY KEY, timestamp TEXT NOT NULL DEFAULT current_timestamp, r0b INTEGER ); CREATE TABLE IF NOT EXISTS diaobjid_stems ( diaObjectId INTEGER PRIMARY KEY, stem TEXT NOT NULL, timestamp TEXT NOT NULL DEFAULT current_timestamp ); CREATE TABLE IF NOT EXISTS provenance ( ID INTEGER PRIMARY KEY, diaObjectId INTEGER, diaSourceId INTEGER, stem TEXT, score REAL, model_name TEXT, model_version TEXT, timestamp TEXT NOT NULL DEFAULT current_timestamp ); CREATE TABLE threshold_flags_provenance( ID INTEGER PRIMARY KEY, diaObjectId INTEGER, diaSourceId INTEGER, stem TEXT, n_gt22 INTEGER, n_gt21 INTEGER, n_gt20 INTEGER, n_gt19 INTEGER, n_gt18 INTEGER, brighter22 INTEGER, brighter21 INTEGER, brighter20 INTEGER, brighter19 INTEGER, brighter18 INTEGER, first22 INTEGER, first21 INTEGER, first20 INTEGER, first19 INTEGER, first18 INTEGER, timestamp TEXT NOT NULL DEFAULT current_timestamp ); Then run: .. code-block:: bash sqlite3 log.db < log_schema.sql .. tip:: To ensure tables are nicely formatted when you run sqlite3 from the command line you need to add a ``.sqliterc`` file in your home directory with the following content: ``.headers on`` and ``.mode column`` (on two separate lines). Local dev env ---------------- I have not set up a docker (and will not for now) but I have local directories that mimick what is on remote. Most importantly I have environments that need to be the same. conda ~~~~~~~~~~~~~~~~~ For the python environment I exported a `yaml` file of the remote conda environement: .. code-block:: bash conda env export --no-builds > lvra_env.yml Then I copied it in my local package: .. code-block:: bash scp lasair@oxdb1:code/lvra_env.yml ./software/lvra Then I created the environment with: .. code-block:: bash conda env create -f software/lvra/lvra_env.yml -n lvra env variables ~~~~~~~~~~~~~~~~~ .. code-block:: bash export LVRA_SETTINGS='/home/stevance/software/lvra/data/public_settings_local.yaml' export LVRA_TRAINING_ROOTDIR='/home/stevance/Science/lvra-training/' export LASAIR_LSST_TOKEN = [see my .bashrc] export LVRA_TNS_API_KEY = [see my .bashrc]