{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dataset preparation\n",
    "\n",
    "This tutorial offers a step-by-step guide on how to prepare your single-cell datasets for training and for subsequent analysis using the UNAGI tool. The UNAGI tool lies in the assumption that the dataset is a time-series single-cell data and time information of each cell is known. Thus, to use the UNAGI tool, it's mandatory to annotate the time-point information for each cell.\n",
    "\n",
    "This example shows how to append time point attributes to the annData object. These time points should be sequentially organized as [0, 1, 2, ..., n]. (0->n, from early time points to late time points) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the project name and load the data\n",
    "import warnings\n",
    "import scanpy as sc\n",
    "import os\n",
    "warnings.filterwarnings('ignore')\n",
    "PATH_TO_YOUR_DATA = 'your_data.h5ad'\n",
    "adata = sc.read(PATH_TO_YOUR_DATA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following code will assign the stage key to each batch according to their time points. (e.g. Assuming the time-series dataset has 3 batches, each comes from an individual time point.) \n",
    "\n",
    "**Note:** UNAGI tool requires the time point $\\geq$ 3 time points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using 'stage' as the key for the stage information in the adata.obs\n",
    "stage_key = 'stage' # change this to whatever you want\n",
    "adata.obs[stage_key] = None\n",
    "\n",
    "#Assume the batch information is in adata.obs['batch'], and the batch names are batch1, batch2, batch3....\n",
    "# Change the following code according to your data\n",
    "adata.obs.loc[adata.obs['batch'] == 'batch1', stage_key] = '0'\n",
    "adata.obs.loc[adata.obs['batch'] == 'batch2', stage_key] = '1'\n",
    "adata.obs.loc[adata.obs['batch'] == 'batch3', stage_key] = '2'\n",
    "#...."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After appending the time-points information, you can either write the whole dataset into the disk or divided it into individual stages and then write to the disk.   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Option 1: Save the data in the disk\n",
    "adata.write(f'{PATH_TO_YOUR_DATA}', compression='gzip', compression_opts=9)\n",
    "\n",
    "# Option 2: Seperate the data into different stages and save them\n",
    "import os\n",
    "dir_name = os.path.dirname(PATH_TO_YOUR_DATA)\n",
    "\n",
    "for each in list(adata.obs[stage_key].unique()):\n",
    "    stage_adata = adata[adata.obs[stage_key] == each]\n",
    "    stage_adata.write(f'{dir_name}/{each}.h5ad', compression='gzip', compression_opts=9)"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}