GitLab system status is available here and here

Commit facfbc21 authored by Patrick Christopher Daniel's avatar Patrick Christopher Daniel
Browse files

Added new notebook for biovolume

parent a1face11
*.mat
data/class/*
data/total_biovol/
.DS_Store
# Created by https://www.toptal.com/developers/gitignore/api/python,jupyternotebooks
# Edit at https://www.toptal.com/developers/gitignore?templates=python,jupyternotebooks
......
......@@ -4,12 +4,26 @@ Written by P. Daniel on 2022/03/29
## Requirements ##
Pandas, scipy, and glob are all needed to find and read the annual classified data. The script outputs a daily mean, daily median, and daily standard deviation of abundance for each class.
## Running ##
## Generating Daily Abundance, Biovolume and Total C ##
Using the syringe level data, a dataframe is made from the class .mat files for each syringe with the class, probability, and raw class (if the probability is below the class and changed to unclassified).
Next, biovolume is extracted from the features files for each syringe and added to the dataframe.
## DEPRECIATED: Generating Daily Abundance from Summary Data ##
We are not able to calculate biovolume using the summary data, so instead, we need to use the class and features files for each syringe. The features data are pretty large (~100+gb for all data since 2015).
__ALSO__, the classified data ignores multi-blob images. The features are extracted, but the autoclassifier doesn't seem to be running them
### Running ###
Make sure the .mat files are in the `./data/v1_27Aug2019/` directory. Then the script can be run using:
`python main.py`
`python calc_daily_summary.py.py`
## About Code ##
### About Code ###
Annual data output from the autoclassifier are stored in .mat (binary matlab) files. These can be accessed using the `scipy.io.load` function.
[x] Load .mat data
......
import pandas as pd
import scipy.io as sio
import glob
def calc_datetime(matlab_timestamp):
"""
Calculate datetime from the matlab timestamp. matlab serial date is represented as the number of days since Jan 1 0000
To get to a unix epoc, subtract 719529 (ie the number of days between Jan 1 0000 and Jan 1, 1970).
https://stackoverflow.com/questions/13965740/converting-matlabs-datenum-format-to-python
"""
return pd.to_datetime(matlab_timestamp.flatten() - 719529, unit='D')
def load_autoclass_data(fname):
"""
Load the summary .mat data and format into a pandas dataframe.
"""
data = sio.loadmat(fname)
timestamps = calc_datetime(data['mdateTB']) # https://stackoverflow.com/questions/13965740/converting-matlabs-datenum-format-to-python
classes = [c[0][0] for c in data['class2useTB']]
df = pd.DataFrame(data=data['classcountTB_above_optthresh'], columns=classes)
df['dateTime'] = timestamps
df['vol_analyzed'] = data['ml_analyzedTB']
df['fnames'] = data['filelistTB']
return df
def make_dataframe():
"""
Create a dataframe and appened it for each .mat file (year)
"""
fnames = sorted(glob.glob('./data/v1_27Aug2019/*.mat'))
df = pd.DataFrame()
for fname in fnames:
print(f"Opening {fname}")
df = df.append(load_autoclass_data(fname))
df.index = df['dateTime']
classes = ['Akashiwo', 'Alexandrium_singlet', 'Amy_Gony_Protoc',
'Asterionellopsis', 'Centric', 'Ceratium', 'Chaetoceros',
'Cochlodinium', 'Cryptophyte,NanoP_less10,small_misc', 'Cyl_Nitz',
'Det_Cer_Lau', 'Dictyocha', 'Dinophysis', 'Eucampia', 'Guin_Dact',
'Gymnodinium,Peridinium', 'Lingulodinium', 'Pennate', 'Prorocentrum',
'Pseudo-nitzschia', 'Scrip_Het', 'Skeletonema', 'Thalassionema',
'Thalassiosira', 'unclassified']
df_norm = df[classes].divide(df['vol_analyzed'], axis=0)
return df_norm
def save_daily_values(df_norm):
"""Save daily mean, median, and standard deviation.
Args:
df_norm (pd.Datafrane): Pandas dataframe with cell abundance for each of the classes.
"""
daily_mean = df_norm.resample('1D').mean()
daily_mean.to_csv("./data/daily_values/ifcb104-daily-mean.csv")
print("Saving Daily Mean")
daily_med = df_norm.resample('1D').median()
daily_med.to_csv("./data/daily_values/ifcb104-daily-median.csv")
print("Saving Daily Median")
daily_std = df_norm.resample('1D').std()
daily_std.to_csv("./data/daily_values/ifcb104-daily-std.csv")
print(f"Saving Daily SD")
if __name__ == "__main__":
df = make_dataframe()
save_daily_values(df)
%% Cell type:code id:2887f277-076e-4864-b1a9-6c5e5fe226f3 tags:
``` python
import pandas as pd
import glob
import scipy.io as sio
import os.path
import numpy as np
from tqdm import tqdm
```
%% Cell type:code id:ba59eec4-876c-4f58-97af-5f533e4074b9 tags:
``` python
files = sorted(glob.glob("../data/class/class2016_v1/*.mat"))
```
%% Cell type:code id:a190fc33-dd3f-4a1b-84c5-2a2a322d14a5 tags:
``` python
def same_class(df):
if df['class'] == df['class_raw']:
return "same"
else:
return df['class_raw']
def build_df(fname: str) -> pd.DataFrame:
data = sio.loadmat(fname)
df = pd.DataFrame()
df['roi'] = data['roinum'].flatten()
t = data['TBclass_above_threshold'].flatten()
df['class'] = [ts[0] for ts in t]
t = data['TBclass'].flatten()
df['class_raw'] = [ts[0] for ts in t]
df['class_raw'] = df.apply(same_class,axis=1)
df['top_prob'] = data['TBscores'].max(axis=1).round(2)
df['fname'] = os.path.basename(fname).split(".")[0].split("_")[0]
df['biovolume'] = np.nan
return df
df = pd.DataFrame()
for i, file in tqdm(enumerate(files), total=len(files)):
df = df.append(build_df(file),ignore_index=True)
```
%% Output
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4376/4376 [03:42<00:00, 19.68it/s]
%% Cell type:markdown id:32b7c5a0-c773-4b26-9323-aee17d39a6ac tags:
__Load and Map Feature Data__
%% Cell type:code id:baf91f1e-13df-41ae-ac4c-5ae5b94978cb tags:
``` python
ffiles = sorted(glob.glob("/Volumes/Extreme SSD/2016/*.csv"))
```
%% Cell type:code id:6c3e7e6d-46d9-4053-95ff-d30de9f17fe4 tags:
``` python
for i, file in tqdm(enumerate(ffiles), total=len(ffiles)):
feat_df = pd.read_csv(file,usecols=[0,2],names=['roi',"biovolume"],skiprows=[0])
feat_df.index = feat_df['roi']
syringe = os.path.basename(file).split("_")[0]
dff = df[df['fname'].str.contains(syringe)]
dff['biovolume'] = dff['roi'].map(feat_df['biovolume'].round(2))
df[df['fname'].str.contains(syringe)] = dff
df.to_csv("../data/total_biovol/IFCB104-2016-biovolume.csv",index=False)
```
%% Output
0%| | 0/4376 [00:00<?, ?it/s]/var/folders/wg/5wjdpvd518dft0bjkj1t1s1h0000gn/T/ipykernel_3994/1055847683.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
dff['biovolume'] = dff['roi'].map(feat_df['biovolume'].round(2))
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 4376/4376 [1:31:56<00:00, 1.26s/it]
%% Cell type:code id:56fdfd0c-6be7-4316-892e-a300f7f4d4a2 tags:
``` python
df
```
%% Output
roi class class_raw top_prob \
0 2 Dictyocha same 0.50
1 3 Dictyocha same 0.46
2 4 Dictyocha same 0.37
3 5 Chaetoceros same 0.66
4 6 Centric same 0.89
... ... ... ... ...
1684544 797 Cryptophyte,NanoP_less10,small_misc same 0.99
1684545 798 Cryptophyte,NanoP_less10,small_misc same 0.99
1684546 799 Cryptophyte,NanoP_less10,small_misc same 1.00
1684547 802 Cryptophyte,NanoP_less10,small_misc same 0.97
1684548 803 Cryptophyte,NanoP_less10,small_misc same 0.81
fname biovolume
0 D20160107T032342 171007.10
1 D20160107T032342 94817.19
2 D20160107T032342 261001.95
3 D20160107T032342 35541.92
4 D20160107T032342 31590.28
... ... ...
1684544 D20161202T181247 4977.85
1684545 D20161202T181247 4973.14
1684546 D20161202T181247 5238.61
1684547 D20161202T181247 8428.89
1684548 D20161202T181247 0.00
[1684549 rows x 6 columns]
%% Cell type:code id:6ac1ac7e-5235-4206-ba4d-191fedb8f38d tags:
``` python
```
%% Cell type:code id:e5ce8ce4-70c5-4fa0-b206-12d7cf80a711 tags:
``` python
```
%% Cell type:code id:cf849469-6206-4504-8d74-ad089f2317b7 tags:
``` python
df.shape
```
%% Output
(1684549, 6)
%% Cell type:code id:6f6a3d66-b079-41fe-a5cd-1ce0289efd0e tags:
``` python
```
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment