socube.data package

Submodules

socube.data.loading module

class socube.data.loading.DatasetBase(labels: pandas.core.frame.DataFrame, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify')

Bases: torch.utils.data.dataset.Dataset

Abstract base class for datasets. All SoCube extended datasets must inherit and implement its abstract interface.

Parameters
  • labels (pd.DataFrame) – Dataframe containing labels for each sample.

  • shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.

  • seed (int, default None) – Random seed for shuffling.

  • k (int, default 5) – Number of folds for k-fold cross-validation.

  • task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.

property kFold

Get generator for k-fold cross-validation dataset

Returns

kFold – An generator for k-fold cross-validation dataset. Each iteration generates a tuple of two Subset objects for training and validating

Return type

generator

abstract sampler(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.Sampler

Abstract method for sampling a subset of this dataset.

Parameters

subset (Subset) – A subset of this dataset.

Returns

sampler – A sampler for the subset.

Return type

Sampler

class socube.data.loading.ConvDatasetBase(data_dir: str, labels: pandas.core.frame.DataFrame, transform: torch.nn.modules.module.Module = None, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify', use_index: bool = True)

Bases: socube.data.loading.DatasetBase

Basical dataset designed for CNN.

Parameters
  • data_dir (str) – Path to the directory containing dataset.

  • labels (pd.DataFrame) – Dataframe containing labels for each sample.

  • transform (torch.nn.Module, default None) – Transform to apply to each sample.

  • shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches.

  • seed (int, default None) – Random seed for shuffling.

  • k (int, default 5) – Number of folds for k-fold cross-validation.

  • task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.

  • use_index (bool, default True) – Whether to use the numeric index as the sample file name, such as “0.npy”, if False, then use the sample name in the labels as the sample file name, such as “sample_name.npy”.

socube.data.preprocess module

socube.data.preprocess.summary(data: pandas.core.frame.DataFrame, axis: int = 1) → pandas.core.frame.DataFrame

Data summary for each column or row.

Parameters
  • data (dataframe) – a dataframe with row and column

  • axis (int, default 1) – 0 for summary for column, 1 for summary for row

Returns

Return type

a dataframe with summary for each column or row

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.summary(data)
socube.data.preprocess.filterData(data: pandas.core.frame.DataFrame, filtered_gene_prop: float = 0.05, filtered_cell_prop: float = 0.05, mini_expr: float = 0.05, mini_library_size: int = 1000) → pandas.core.frame.DataFrame

Remove genes and cells which have low variation with given proportions and remove genes whose average expression less then mini_expr and remove cells whose cell library size less then mini_library_size.

Parameters
  • data (dataframe) – a dataframe, which row is gene and column is cell

  • filtered_gene_prop (float, default 0.05) – Remove genes with low variation with this proportion

  • filtered_cell_prop (float, default 0.05) – Remove cells with low variation with this proportion

  • mini_expr (float, default 0.05) – Remove genes whose average expression less then mini_expr

  • mini_library_size (int, default 1000) – Remove cells whose cell library size less then mini_library_size

Returns

Return type

a dataframe with filtered genes and cells

socube.data.preprocess.minmax(data: pandas.core.frame.DataFrame, range: Tuple[int] = (0, 1), flag: int = 0, dtype: str = 'float32') → pandas.core.frame.DataFrame

Perform maximum-minimum normalization

Parameters
  • data (dataframe) – a dataframe, which row is sample and column is feature

  • range (tuple, default (0, 1)) – The maximum and minimum values of the normalized data, normalized to 0~1 by default

  • flag (int, default 0) – Equal to 0 for minmax by columns, greater than 0 for minmax by rows, less than 0 for minmax by global.

  • dtype (str, default "float32") – The data type of the normalized data

Returns

Return type

a dataframe with normalized data

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.minmax(data)
socube.data.preprocess.std(data: pandas.core.frame.DataFrame, horizontal: bool = False, dtype: str = 'float32', global_minmax: bool = False) → pandas.core.frame.DataFrame

Standardization of data

Parameters
  • data (dataframe) – a dataframe, which row is sample and column is feature

  • horizontal (bool, default False) – If True, perform standardization horizontally

  • dtype (str, default "float32") – The data type of the standardized data

  • global_minmax (bool, default False) – If True, perform global standardization, otherwise standardization by row or column

Returns

Return type

a dataframe with standardized data

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.std(data)
socube.data.preprocess.cosineDistanceMatrix(x1: torch.Tensor, x2: torch.Tensor = None, device_name: str = 'cpu') → torch.Tensor

Calculate the cosine distance matrix between the two sets of samples.

Parameters
  • x1 (torch.Tensor) – a tensor of samples, with shape (n1, d)

  • x2 (torch.Tensor, default None) – a tensor of samples, with shape (n2, d), if None, x2 = x1

  • device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.

Returns

Return type

a tensor of cosine distance matrix, with shape (n1, n2)

Examples

>>> import torch
>>> import socube.data.preprocess as pre
>>> x1 = torch.rand(10, 10)
>>> x2 = torch.rand(10, 10)
>>> pre.cosineDistanceMatrix(x1, x2)
socube.data.preprocess.scatterToGrid(scatters2d: torch.Tensor, transform: Tuple[int] = (1, -1), device_name: str = 'cpu') → torch.Tensor

Scattered coordinates-grid coordinate mapping based on J-V linear assignment algorithm

Parameters
  • scatters2d (torch.Tensor) – a tensor of scattered coordinates, with shape (n, 2)

  • transform (tuple, default (1, -1)) – the transformation of the coordinates, such as (1, -1). Greater than 0 means that the corresponding coordinates do not change direction, less than 0 means that the corresponding coordinates are reversed. For scatter and grid plots, the y-axis directions are often opposite, and for visual consistency, the y-axis needs to be transformed

  • device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.

Returns

Return type

a tensor of grid coordinates, with shape (n, 2)

Examples

>>> import torch
>>> import socube.data.preprocess as pre
>>> scatters2d = torch.rand(10, 2)
>>> pre.scatterToGrid(scatters2d)
socube.data.preprocess.umap2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', neighbors: int = 5, seed: int = None) → pandas.core.frame.DataFrame

Reducing high-dimensional data to 2D using UMAP

Parameters
  • data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data

  • metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.

  • neighbors (int, default 5) – the number of neighbors used for UMAP.

  • seed (int, default None) – the random seed used for UMAP.

Returns

Return type

a dataframe of two-dimensional data, with shape (n, 2)

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.umap2D(data)
socube.data.preprocess.tsne2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', seed: int = None) → pandas.core.frame.DataFrame

Reducing high-dimensional data to 2D using t-SNE

Parameters
  • data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data

  • metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.

  • seed (int, default None) – the random seed used for t-SNE.

Returns

Return type

a dataframe of two-dimensional data, with shape (n, 2)

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.tsne2D(data)
socube.data.preprocess.vec2Grid(vector: numpy.ndarray, shuffle: bool = False, seed: int = None) → pandas.core.frame.DataFrame

Converts a one-dimensional vector to a two-dimensional grid

Parameters
  • vector (np.ndarray) – a one-dimensional vector, with shape (n,). n is the number of samples

  • shuffle (bool, default False) – whether to shuffle the vector

  • seed (int, default None) – the random seed used for shuffling the vector

Returns

  • a dataframe of two-dimensional data, with shape (n, 2),

  • each row represents a grid point with horizontal (x) and vertical (y) coordinates

Examples

>>> import numpy as np
>>> import socube.data.preprocess as pre
>>> vector = np.random.rand(10)
>>> pre.vec2Grid(vector)
socube.data.preprocess.onehot(label: numpy.ndarray, class_nums: int = None) → numpy.ndarray

Convert 1D multi-label vector (each element is a sample’s label) to onehot matrix. The label should be a integer

Parameters
  • label (np.ndarray) – a one-dimensional integer vector, with shape (n,). n is the number of samples

  • class_nums (int, default None) – the number of classes. If None, the number of classes is automatically determined.

Returns

Return type

a ndarray of onehot matrix with shape (n, class_nums)

Examples

>>> onehot(np.array([1,2,4]))
array([[0, 1, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1]], dtype=int32)
>>> onehot(np.array([1,2,4]), 6)
array([[0, 1, 0, 0, 0, 0],
    [0, 0, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 0]], dtype=int32)
socube.data.preprocess.items(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

Convert a dataFrame to a dataframe with row, col, val three columns

Parameters

data (pd.DataFrame) –

Returns

Return type

a dataframe with row, col, val three columns

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.items(data)

socube.data.visualize module

socube.data.visualize.getHeatColor(intensity: float, topColor: str, bottomColor: str = '#ffffff') → str

Calculation of heat map color values based on intensity values

Parameters
  • intensity (float value) – color intensity values between 0 and 1

  • topColor (hexadecimal color string) – Color value when intensity value is 1

  • bottomColor (hexadecimal color string) – Color value when intensity value is 0, default is white

Returns

Return type

The hexadecimal color string corresponding to the intensity

Examples

>>> getHeatColor(0.5, "#ff0000")
'#ff7f7f'
socube.data.visualize.convertHexToRGB(hex_color: str) → Tuple[int]

Convert hexadecimal color strings to RGB tri-color integer tuples

Parameters

hex_color (hexadecimal color string, such as '#ff0000') –

Returns

Return type

RGB tri-color integer tuples

Examples

>>> hexToRGB('#ff0000')
(255, 0, 0)
socube.data.visualize.convertRGBToHex(color: Tuple[int]) → str

Convert RGB tricolor integer tuple to hexadecimal color string

Parameters

color (RGB tricolor integer tuple) – such as (255, 0, 0)

Returns

Return type

hexadecimal color string, such as ‘#ff0000’

Examples

>>> rgbToHex((255, 0, 0))
'#ff0000'
socube.data.visualize.plotScatter(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], title: str, subtitle: str, filename: str = None, scatter_symbol: str = 'circle', width: int = 1000, height: int = 850, radius: int = 3, x_min: Optional[int] = None, y_min: Optional[int] = None, x_max: Optional[int] = None, y_max: Optional[int] = None, x_title: Optional[str] = None, y_title: Optional[str] = None)

Draw the scatter image of socube

Parameters
  • data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.

  • colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.

  • title (str) – The title of the plot

  • subtitle (str) – The subtitle of the plot

  • filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.

  • width (int) – The width of the plot, unit is pixel

  • height (int) – The height of the plot, unit is pixel

  • radius (int) – The radius of the scatter point, unit is pixel

Returns

Return type

The plot object

Examples

>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]})
>>> colormap = {'0': '#ff0000', '1': '#00ff00'}
>>> plotScatter(data2d, colormap, 'title', 'subtitle', 'test.html')
socube.data.visualize.plotGrid(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], shape: Tuple[int], title: str, subtitle: str, filename: str, width: int = 1000, height: int = 850) → highcharts.highcharts.highcharts.Highchart

Draw socube’s Grid image

Parameters
  • data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.

  • colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.

  • shape (Tuple[int]) – The shape of the grid, (row, col)

  • title (str) – The title of the plot

  • subtitle (str) – The subtitle of the plot

  • filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.

  • width (int) – The width of the plot, unit is pixel

  • height (int) – The height of the plot, unit is pixel

Returns

Return type

The plot object

Examples

>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]})
>>> colormap = {'0': '#ff0000', '1': '#00ff00'}
>>> plotGrid(data2d, colormap, (6, 3), 'title', 'subtitle', 'test.html')
socube.data.visualize.plotAUC(data: Dict[str, Tuple[Sequence]], title: str, xlabel: str, ylabel: str, file: Optional[str] = None, slash: int = 0)

Plot a AUC curve

Parameters
  • data (Dict[str, Tuple[Sequence]]) – The data to be plotted, with key as the name of the curve and value as the (x, y) data.

  • title (str) – The title of the plot

  • xlabel (str) – The xlabel of the plot

  • ylabel (str) – The ylabel of the plot

  • file (str) – The filename of the plot, if None, the plot will not be saved

  • slash (int) – if slash is positve, will plot a forward slash, if slash is negative, will plot a back slash.

Examples

>>> data = {'AUC': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3]), 'AUC2': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3])}
>>> plotAUC(data, 'AUC', 'xlabel', 'ylabel', 'auc.png')

Module contents

socube.data.summary(data: pandas.core.frame.DataFrame, axis: int = 1) → pandas.core.frame.DataFrame

Data summary for each column or row.

Parameters
  • data (dataframe) – a dataframe with row and column

  • axis (int, default 1) – 0 for summary for column, 1 for summary for row

Returns

Return type

a dataframe with summary for each column or row

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.summary(data)
socube.data.filterData(data: pandas.core.frame.DataFrame, filtered_gene_prop: float = 0.05, filtered_cell_prop: float = 0.05, mini_expr: float = 0.05, mini_library_size: int = 1000) → pandas.core.frame.DataFrame

Remove genes and cells which have low variation with given proportions and remove genes whose average expression less then mini_expr and remove cells whose cell library size less then mini_library_size.

Parameters
  • data (dataframe) – a dataframe, which row is gene and column is cell

  • filtered_gene_prop (float, default 0.05) – Remove genes with low variation with this proportion

  • filtered_cell_prop (float, default 0.05) – Remove cells with low variation with this proportion

  • mini_expr (float, default 0.05) – Remove genes whose average expression less then mini_expr

  • mini_library_size (int, default 1000) – Remove cells whose cell library size less then mini_library_size

Returns

Return type

a dataframe with filtered genes and cells

socube.data.minmax(data: pandas.core.frame.DataFrame, range: Tuple[int] = (0, 1), flag: int = 0, dtype: str = 'float32') → pandas.core.frame.DataFrame

Perform maximum-minimum normalization

Parameters
  • data (dataframe) – a dataframe, which row is sample and column is feature

  • range (tuple, default (0, 1)) – The maximum and minimum values of the normalized data, normalized to 0~1 by default

  • flag (int, default 0) – Equal to 0 for minmax by columns, greater than 0 for minmax by rows, less than 0 for minmax by global.

  • dtype (str, default "float32") – The data type of the normalized data

Returns

Return type

a dataframe with normalized data

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.minmax(data)
socube.data.std(data: pandas.core.frame.DataFrame, horizontal: bool = False, dtype: str = 'float32', global_minmax: bool = False) → pandas.core.frame.DataFrame

Standardization of data

Parameters
  • data (dataframe) – a dataframe, which row is sample and column is feature

  • horizontal (bool, default False) – If True, perform standardization horizontally

  • dtype (str, default "float32") – The data type of the standardized data

  • global_minmax (bool, default False) – If True, perform global standardization, otherwise standardization by row or column

Returns

Return type

a dataframe with standardized data

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.std(data)
socube.data.cosineDistanceMatrix(x1: torch.Tensor, x2: torch.Tensor = None, device_name: str = 'cpu') → torch.Tensor

Calculate the cosine distance matrix between the two sets of samples.

Parameters
  • x1 (torch.Tensor) – a tensor of samples, with shape (n1, d)

  • x2 (torch.Tensor, default None) – a tensor of samples, with shape (n2, d), if None, x2 = x1

  • device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.

Returns

Return type

a tensor of cosine distance matrix, with shape (n1, n2)

Examples

>>> import torch
>>> import socube.data.preprocess as pre
>>> x1 = torch.rand(10, 10)
>>> x2 = torch.rand(10, 10)
>>> pre.cosineDistanceMatrix(x1, x2)
socube.data.scatterToGrid(scatters2d: torch.Tensor, transform: Tuple[int] = (1, -1), device_name: str = 'cpu') → torch.Tensor

Scattered coordinates-grid coordinate mapping based on J-V linear assignment algorithm

Parameters
  • scatters2d (torch.Tensor) – a tensor of scattered coordinates, with shape (n, 2)

  • transform (tuple, default (1, -1)) – the transformation of the coordinates, such as (1, -1). Greater than 0 means that the corresponding coordinates do not change direction, less than 0 means that the corresponding coordinates are reversed. For scatter and grid plots, the y-axis directions are often opposite, and for visual consistency, the y-axis needs to be transformed

  • device_name (str, default "cpu") – the device used for accelerating the calculation such as “cpu”, “cuda:0”, “cuda:1”, etc.

Returns

Return type

a tensor of grid coordinates, with shape (n, 2)

Examples

>>> import torch
>>> import socube.data.preprocess as pre
>>> scatters2d = torch.rand(10, 2)
>>> pre.scatterToGrid(scatters2d)
socube.data.umap2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', neighbors: int = 5, seed: int = None) → pandas.core.frame.DataFrame

Reducing high-dimensional data to 2D using UMAP

Parameters
  • data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data

  • metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.

  • neighbors (int, default 5) – the number of neighbors used for UMAP.

  • seed (int, default None) – the random seed used for UMAP.

Returns

Return type

a dataframe of two-dimensional data, with shape (n, 2)

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.umap2D(data)
socube.data.tsne2D(data: pandas.core.frame.DataFrame, metric: str = 'correlation', seed: int = None) → pandas.core.frame.DataFrame

Reducing high-dimensional data to 2D using t-SNE

Parameters
  • data (pd.DataFrame) – a dataframe of high-dimensional data, with shape (n, d). n is the number of samples, d is the dimension of the data

  • metric (str, default 'correlation') – the metric used for calculating the distance between samples. such as ‘correlation’, ‘euclidean’, ‘manhattan’, etc.

  • seed (int, default None) – the random seed used for t-SNE.

Returns

Return type

a dataframe of two-dimensional data, with shape (n, 2)

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.tsne2D(data)
socube.data.vec2Grid(vector: numpy.ndarray, shuffle: bool = False, seed: int = None) → pandas.core.frame.DataFrame

Converts a one-dimensional vector to a two-dimensional grid

Parameters
  • vector (np.ndarray) – a one-dimensional vector, with shape (n,). n is the number of samples

  • shuffle (bool, default False) – whether to shuffle the vector

  • seed (int, default None) – the random seed used for shuffling the vector

Returns

  • a dataframe of two-dimensional data, with shape (n, 2),

  • each row represents a grid point with horizontal (x) and vertical (y) coordinates

Examples

>>> import numpy as np
>>> import socube.data.preprocess as pre
>>> vector = np.random.rand(10)
>>> pre.vec2Grid(vector)
socube.data.onehot(label: numpy.ndarray, class_nums: int = None) → numpy.ndarray

Convert 1D multi-label vector (each element is a sample’s label) to onehot matrix. The label should be a integer

Parameters
  • label (np.ndarray) – a one-dimensional integer vector, with shape (n,). n is the number of samples

  • class_nums (int, default None) – the number of classes. If None, the number of classes is automatically determined.

Returns

Return type

a ndarray of onehot matrix with shape (n, class_nums)

Examples

>>> onehot(np.array([1,2,4]))
array([[0, 1, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 0, 1]], dtype=int32)
>>> onehot(np.array([1,2,4]), 6)
array([[0, 1, 0, 0, 0, 0],
    [0, 0, 1, 0, 0, 0],
    [0, 0, 0, 0, 1, 0]], dtype=int32)
socube.data.items(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame

Convert a dataFrame to a dataframe with row, col, val three columns

Parameters

data (pd.DataFrame) –

Returns

Return type

a dataframe with row, col, val three columns

Examples

>>> import pandas as pd
>>> import socube.data.preprocess as pre
>>> data = pd.DataFrame(np.random.rand(10, 10))
>>> pre.items(data)
class socube.data.DatasetBase(labels: pandas.core.frame.DataFrame, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify')

Bases: torch.utils.data.dataset.Dataset

Abstract base class for datasets. All SoCube extended datasets must inherit and implement its abstract interface.

Parameters
  • labels (pd.DataFrame) – Dataframe containing labels for each sample.

  • shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches. Note that the samples within each split will not be shuffled.

  • seed (int, default None) – Random seed for shuffling.

  • k (int, default 5) – Number of folds for k-fold cross-validation.

  • task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.

property kFold

Get generator for k-fold cross-validation dataset

Returns

kFold – An generator for k-fold cross-validation dataset. Each iteration generates a tuple of two Subset objects for training and validating

Return type

generator

abstract sampler(subset: torch.utils.data.dataset.Subset) → torch.utils.data.sampler.Sampler

Abstract method for sampling a subset of this dataset.

Parameters

subset (Subset) – A subset of this dataset.

Returns

sampler – A sampler for the subset.

Return type

Sampler

class socube.data.ConvDatasetBase(data_dir: str, labels: pandas.core.frame.DataFrame, transform: torch.nn.modules.module.Module = None, shuffle: bool = False, seed: int = None, k: int = 5, task_type: str = 'classify', use_index: bool = True)

Bases: socube.data.loading.DatasetBase

Basical dataset designed for CNN.

Parameters
  • data_dir (str) – Path to the directory containing dataset.

  • labels (pd.DataFrame) – Dataframe containing labels for each sample.

  • transform (torch.nn.Module, default None) – Transform to apply to each sample.

  • shuffle (bool, default False) – Whether to shuffle each class’s samples before splitting into batches.

  • seed (int, default None) – Random seed for shuffling.

  • k (int, default 5) – Number of folds for k-fold cross-validation.

  • task_type (str, default "classify") – Type of task. Must be one of “classify”, “regress”.

  • use_index (bool, default True) – Whether to use the numeric index as the sample file name, such as “0.npy”, if False, then use the sample name in the labels as the sample file name, such as “sample_name.npy”.

socube.data.getHeatColor(intensity: float, topColor: str, bottomColor: str = '#ffffff') → str

Calculation of heat map color values based on intensity values

Parameters
  • intensity (float value) – color intensity values between 0 and 1

  • topColor (hexadecimal color string) – Color value when intensity value is 1

  • bottomColor (hexadecimal color string) – Color value when intensity value is 0, default is white

Returns

Return type

The hexadecimal color string corresponding to the intensity

Examples

>>> getHeatColor(0.5, "#ff0000")
'#ff7f7f'
socube.data.convertHexToRGB(hex_color: str) → Tuple[int]

Convert hexadecimal color strings to RGB tri-color integer tuples

Parameters

hex_color (hexadecimal color string, such as '#ff0000') –

Returns

Return type

RGB tri-color integer tuples

Examples

>>> hexToRGB('#ff0000')
(255, 0, 0)
socube.data.convertRGBToHex(color: Tuple[int]) → str

Convert RGB tricolor integer tuple to hexadecimal color string

Parameters

color (RGB tricolor integer tuple) – such as (255, 0, 0)

Returns

Return type

hexadecimal color string, such as ‘#ff0000’

Examples

>>> rgbToHex((255, 0, 0))
'#ff0000'
socube.data.plotScatter(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], title: str, subtitle: str, filename: str = None, scatter_symbol: str = 'circle', width: int = 1000, height: int = 850, radius: int = 3, x_min: Optional[int] = None, y_min: Optional[int] = None, x_max: Optional[int] = None, y_max: Optional[int] = None, x_title: Optional[str] = None, y_title: Optional[str] = None)

Draw the scatter image of socube

Parameters
  • data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.

  • colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.

  • title (str) – The title of the plot

  • subtitle (str) – The subtitle of the plot

  • filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.

  • width (int) – The width of the plot, unit is pixel

  • height (int) – The height of the plot, unit is pixel

  • radius (int) – The radius of the scatter point, unit is pixel

Returns

Return type

The plot object

Examples

>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]})
>>> colormap = {'0': '#ff0000', '1': '#00ff00'}
>>> plotScatter(data2d, colormap, 'title', 'subtitle', 'test.html')
socube.data.plotGrid(data2d: pandas.core.frame.DataFrame, colormap: Dict[str, str], shape: Tuple[int], title: str, subtitle: str, filename: str, width: int = 1000, height: int = 850) → highcharts.highcharts.highcharts.Highchart

Draw socube’s Grid image

Parameters
  • data2d (pandas.DataFrame) – The data to be plotted, with columns of x, y, label and subtype, if subtype is float, it regarded as intensity value, and the color will be calculated based on the intensity value.

  • colormap (Dict[str, str]) – The color map for the subtype, with key as subtype name and value as hexadecimal color string. If the subtype is float, colormap’s key should be ‘0’ and ‘1’, ‘0’ for low intensity color and ‘1’ for high intensity color.

  • shape (Tuple[int]) – The shape of the grid, (row, col)

  • title (str) – The title of the plot

  • subtitle (str) – The subtitle of the plot

  • filename (str) – The filename of the plot, if None, the plot will not be saved. format is html and the filename extension will automatically be added and you should not add it.

  • width (int) – The width of the plot, unit is pixel

  • height (int) – The height of the plot, unit is pixel

Returns

Return type

The plot object

Examples

>>> data2d = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6], 'label': ['a', 'b', 'c'], 'subtype': [0.5, 0.7, 0.9]})
>>> colormap = {'0': '#ff0000', '1': '#00ff00'}
>>> plotGrid(data2d, colormap, (6, 3), 'title', 'subtitle', 'test.html')
socube.data.plotAUC(data: Dict[str, Tuple[Sequence]], title: str, xlabel: str, ylabel: str, file: Optional[str] = None, slash: int = 0)

Plot a AUC curve

Parameters
  • data (Dict[str, Tuple[Sequence]]) – The data to be plotted, with key as the name of the curve and value as the (x, y) data.

  • title (str) – The title of the plot

  • xlabel (str) – The xlabel of the plot

  • ylabel (str) – The ylabel of the plot

  • file (str) – The filename of the plot, if None, the plot will not be saved

  • slash (int) – if slash is positve, will plot a forward slash, if slash is negative, will plot a back slash.

Examples

>>> data = {'AUC': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3]), 'AUC2': ([0.1, 0.2, 0.3], [0.1, 0.2, 0.3])}
>>> plotAUC(data, 'AUC', 'xlabel', 'ylabel', 'auc.png')