classified_to_seabass module
This module contains a class to convert CNN-classified and/or validated datasets to a SeaBASS-compatible format.
Imports:
- pandas
- numpy
- os
- utopia_pipeline_tools as upt
- list_files_in_blob from upt.azure_blob_tools
- retrieve_filepaths_from_local from upt.ifcb_data_tools
MakeSeaBASS(metadata_filepath, class_filepath, experiment, cruise, location='blob', container=None, folder_filepath=None, investigator_info=upt.default_investigators, stations=True, flags=False, doc_list=None, data_status='final', trigger_mode='both', notes=None, sample_filepath=None, filepaths=None, config_info=upt.config_info, cal_ratio=upt.calibration_ratio)
When initialized, this class loads the classification and metadata csv files and sets up the header values that are consistent across all samples. Also retrieves the file-names of all sample csv files from the sample subfolders within the ‘ml’ folder.
\_\_init\_\_
Parameters:
- metadata_filepath (str): Filepath to the metadata file used to convert ifcb data from raw to processed. This file must be a csv and have columns containing lat/long, temperature, salinity, and sample volume, concentration, and flag information.
- class_filepath (str): Filepath to the .csv file generated by the code that applies the CNN to the ml folder of ifcb images. This data must include ‘filepath’, ‘pred_label’, and probability columns labelled ‘0’ to ‘9’.
- experiment (str): The name of the overall experiment. This must match the experiment name in SeaBASS records.
- cruise (str): The name of the specific cruise where data was collected. This needs to match the name of the cruise in SeaBASS records.
- location (str, kwarg): Indicates where the data is stored. Can be ‘blob’ or ‘local’.
- folder_filepath (str, conditional): If location is set to be ‘local’, use this input to specify the location of the ‘ml’ folder that contains the ifcb data on your local machine. Default is ‘blob’.
- container (str, conditional kwarg): If location is set to be ‘blob’, this input indicates which blob container the data is stored in.
- investigator_info (dict): Investigator information saved in a dictionary. The dictionary must be of the form: {Firstname_Lastname: [Org, email], Firstname_Lastname: [Org, email], …}. Include no spaces in the investigator or organization names.
- stations (bool, optional): Indicates whether the metadata includes station information. Input should be True or False. Default True.
- flags (bool, optional): Indicates whether the metadata inlcudes flag information. Default False.
- doc_list (str): A comma-separated string of documents associated with the SeaBASS submission. Should include the names of the protocol, checklist, and taxonomic ID documents at minimum. Include no spaces.
- data_status (str, kwarg): Indicates whether the data is preliminary or final. This input should be either ‘preliminary’ or ‘final’. Not case sensitive.
- trigger_mode (str, kwarg): Indicates whether the data was collected on the ‘chlorophyll’, ‘scattering’, or ‘both’ setting. Default ‘both’.
- notes (str, optional): A location to insert any additional notes in the notes section of the header.
- sample_filepath (str, optional): Filepath to the sample-specific metadata file. This is an optional input if you want to use the tool for just one SeaBASS file.
- filepaths (DataFrame, optional): Filepaths to the sample-specific metadata files. An option if you already have these filepaths and don’t want to spend time retrieving them from the blob/folder. Must have column name ‘filepath’.
- config_info (dict): Dictionary of configuration information. Template in the __init__ file of this package.
- cal_ratio (float, optional): The pixels:micrometer ratio of IFCB images. Default is 2.7488.
\_\_init\_\_
Saved Values:
- self.blob (bool): True if images are stored in the Azure blob.
- self.cal_ratio (float): Pixel/micrometer ratio.
- self.classification_df (DataFrame): Loaded dataframe containing image filepaths, classification probabilities, and a numerical predicted label corresponding to one of the groups in upt’s label_dict.
- self.flags_bool (bool): True if flags are included in the metadata file.
- self.header_values (dict): Dictionary of values that are consistent over the entire dataset. Used to populate the SeaBASS file header. Values include investigators, affiliations, emails, experiment, cruise, documents, calibration_file, data_type, data_status, water_depth, pixel_per_um, blob_location, associated_archives, associated_archive_types, length_representation_instrument_varname, width_representation_instrument_varname, missing, delimiter, and ifcb_trigger_mode.
- self.metadata_df (DataFrame): Loaded dataframe of dataset metadata.
- self.notes (str, conditional): If notes were added during the initialization of the class, this is where they are stored.
- self.notes_bool (bool): True if notes were added.
- self.sample_df (DataFrame, conditional): Only present if the class was initialized with sample_filepath != None.
- self.sample_filenames (DataFrame): Dataframe of sample metadata csv file names.
- self.stations_bool (bool): True if station metadata is present.
Functions:
preview_seabass(self, n=0)
:
Generates the string of a single SeaBASS file. Uses the first csv in the full dataset or the single sample csv depending on how the class was initialized.
Parameters:
- n (int, optional): The integer index value of the sample you want to preview. Defaults as the first file in the filename list.
Returns:
- sb_string (str): A SeaBASS-formatted string with all header information and image data.
make_seabass_files(self)
:
Loops over all samples in the dataset, generating a SeaBASS file for each sample and saving it to a local folder called: {cruise}SeaBASS{data_status}.
compile_header(self, sample_filename, sample_df)
:
Calls the values stored in the header_values dictionary, generates some additional sample-specific values, and puts them in the header format required for SeaBASS files.
Parameters:
- sample_filename (str): Filename of the sample metadata csv.
- sample_df (DataFrame): Dataframe that stores sample- and image-specific data.
Saved Values:
- self.filename (str): Name of the SeaBASS file being generated.
Returns:
- header (str): The header section of the SeaBASS file.
extract_sample_ID(self, filepath)
:
Extracts the sample ID (of the form DYYYYMMDDTHHMMSS_IFCB###). This is the name of the sample folder and is included in the filenames of all sample images. Assumes IFCB number has three digits.
Parameters:
- filepath (str): Filepath to the sample folder or to the sample csv.
Returns:
- sampleID (str): Includes a string of letters and numbers representing the date and time of when the sample was taken and the IFCB instrument number of the IFCB used to collect the sample. A unique sample identifier.
extract_sample_info(self, filepath)
:
Uses the filepath to extract information about the sample.
Parameters:
- filepath (str): Filepath to the sample folder or to the sample csv.
Returns:
- date (str): Date in the form YYYYMMDD.
- time (str): Time of collection in the 24-hr form HH:MM:SS[GMT].
- ifcb_number (str): IFCB number, assumes the instrument number has three digits. If it has more or less, adjust the n_digits value in the code.
extract_investigator_info(self, dictionary=upt.default_investigators)
:
Retrieves investigator information from the config dictionary.
Parameters:
- dictionary (dict): Dictionary containing investigator names, affiliations, and emails.
Returns:
- investigators (str): Comma-separated string of investigators with no spaces.
- affiliations_list (str): Comma-separated string of affiliated organizations with no spaces.
- emails_list (str): Comma-separated string of the investigators’ emails with no spaces.
extract_metadata_for_header(self, sample_ID, sample_df, stations_bool, flags_bool)
:
Extracts sample-specific metadata values from the sample csv and the general metadata csv.
Parameters:
- sample_ID (str): Unique identifier of a sample.
- sample_df (DataFrame): The sample dataframe saved in the sample folder within the ml folder. Should contain ‘Latitude’, ‘Longitude’, ‘Temperature’, ‘Salinity’, ‘Depth’, and ‘Concentration’ columns.
- stations_bool (bool): Indicates whether or not to look for a ‘Station’ column in the metadata file.
- flags_bool (bool): Indicates whether or not to look for a ‘Flag’ column in the metadata file.
Returns:
- data_type (str): (DEPRECATED: Now using ‘taxonomy’ for all samples) Describes how the data was taken, i.e. in-line.
- lat (float): Latitude reading at time of sample.
- long (float): Longitude reading at time of sample.
- temp (float): Temperature measurement.
- salinity (float): Salinity measurement.
- depth (float): Depth at which the sample was retrieved.
- vol_sampled (float/int): Volume of the sample taken (mL).
- vol_imaged (float): Volume of water from the sample imaged (mL)
- flag (int/str): Integer representing a flag describing the sample conditions. For instance, if there was bad alignment on the IFCB for a specific sample, that would be flagged.
- station (int/str): Indicates which station the sample was taken at.
- concentration (float/int): The sample’s concentration.
run_sample_checks(self, sample_filename, sample_df)
:
Checks that the number of rows in the associated data files are as expected. The sample and classification dataframes should have the same number of rows pertaining to the given sampleID, and the metadata dataframe should only have 1.
Parameters
- sample_filename (str): Name of the sample csv file.
- sample_df (DataFrame): The sample-specific dataframe stored in the sample folder.
structure_data(self, sample_filename, sample_df)
:
Compiles and converts sample metadata into the SeaBASS format. Each line of data represents a single image, with comma-separated attributes indicated by the /fields value in the header.
Parameters:
- sample_filename (str): Name of the sample csv file.
- sample_df (DataFrame): The sample-specific dataframe stored in the sample folder.
Returns:
- data_string (str): The string of SeaBASS-formatted data extracted from the sample_df.
write_seabass(self, sample_filename, sample_df)
:
Combines the header and data into a single, correctly formatted string.
Parameters:
- sample_filename (str): Name of the sample csv file.
- sample_df (DataFrame): The sample-specific dataframe stored in the sample folder.
Returns:
- full_string (str): The full SeaBASS-structured string containing header and formatted data sections.