{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Préambule : nos biais inconscients"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nous vous proposons, si vous le souhaitez, de prendre une dizaine de minutes pour tester vos biais inconscients:\n",
"\n",
"https://implicit.harvard.edu/implicit/canadafr/takeatest.html\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# TD 2: Manipulation des données"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a best practice, please use a virtual environment following the step explained here\n",
"[python virtual env](https://ecampus.paris-saclay.fr/pluginfile.php/3606005/mod_folder/content/0/installation.md?forcedownload=1)\n",
"\n",
"```\n",
"numpy==1.25\n",
"fairlearn==0.9.0\n",
"plotly==5.24.1\n",
"nbformat==5.10.4\n",
"ipykernel==6.29.5\n",
"aif360['inFairness']==0.6.1\n",
"causal-learn==0.1.4.0\n",
"```\n",
"\n",
"\n",
"You with also need to have R installed \n",
"`sudo apt install r-base-core`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this TD we will use data from the [Medical Expenditure Panel Survey](https://meps.ahrq.gov/mepsweb/). The TD is inspired from [AIF360 tutorial](https://github.com/Trusted-AI/AIF360/blob/main/examples/tutorial_medical_expenditure.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download the dataset\n",
"\n",
"Use the command lines below, if you encounter a problem do not hesitate to call for help\n",
"\n",
"``` \n",
"python_bin_path=\"$(which python)\"; \\\n",
"meps_path=\"$(dirname $python_bin_path})\"; \\\n",
"cd $meps_path; \\\n",
"cd ../lib/python3.10/site-packages/aif360/data/raw/meps; \\\n",
"Rscript generate_data.R\n",
"```\n",
"\n",
"It will ask to read the rules and restrictions to download and use this dataset.\n",
"This is because the dataset is a medical dataset witl real person information.\n",
"\n",
"The download can take a bit of time"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"import numpy as np\n",
"import pandas as pd\n",
"import plotly.express as px\n",
"import warnings\n",
"warnings.simplefilter(action='ignore', category=FutureWarning)\n",
"warnings.simplefilter(action='ignore', append=True, category=UserWarning)\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Datasets\n",
"from aif360.datasets import MEPSDataset19\n",
"from aif360.datasets import MEPSDataset20\n",
"from aif360.datasets import MEPSDataset21\n",
"\n",
"MEPSDataset19_data = MEPSDataset19()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"(dataset_orig_panel19_train,\n",
" dataset_orig_panel19_val,\n",
" dataset_orig_panel19_test) = MEPSDataset19().split([0.5, 0.8], shuffle=True)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(7915, 4749, 3166)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(dataset_orig_panel19_train.instance_weights), len(dataset_orig_panel19_val.instance_weights), len(dataset_orig_panel19_test.instance_weights)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nous vous conseillons d'aller voir les pages [MEPSDataset19](https://aif360.readthedocs.io/en/latest/modules/generated/aif360.datasets.MEPSDataset19.html) et [AIF360 tutorial](https://github.com/Trusted-AI/AIF360/blob/main/examples/tutorial_medical_expenditure.ipynb) pour mieux comprendre le dataset.\n",
"\n",
"Ce qu'il faut avoir lu:\n",
"- **The sensitive attribute is 'RACE' :1 is privileged, 0 is unprivileged** ; It is constructed as follows: 'Whites' (privileged class) defined by the features RACEV2X = 1 (White) and HISPANX = 2 (non Hispanic); 'Non-Whites' that included everyone else.\n",
"(The features 'RACEV2X', 'HISPANX' etc are removed, and replaced by the 'RACE')\n",
"- **'UTILIZATION' is the outcome (the label to predict for a ML model) 0 is positive 1 is negative**. It is a binary composite feature, created to measure the total number of trips requiring some sort of medical care, it sum up the following features (that are removed from the data):\n",
" * OBTOTV15(16), the number of office based visits\n",
" * OPTOTV15(16), the number of outpatient visits\n",
" * ERTOT15(16), the number of ER visits\n",
" * IPNGTD15(16), the number of inpatient nights\n",
" * HHTOTD16, the number of home health visits\n",
"UTILISATION is set to 1 when te sum is above or equal to 10, else it is set to 0\n",
"- **The dataset is weighted** The dataset come with an 'instance_weights' attribute that corresponds to the feature perwt15f these weights are supposed to generate estimates that are representative of the United State (US) population in 2015.\n",
"\n",
"\n",
"Ce qu'il faut avoir retenu:\n",
"- **The sensitive attribute is 'RACE' :1 is privileged, 0 is unprivileged**\n",
"- **'UTILIZATION' is the outcome (the label to predict for a ML model) 0 is positive 1 is negative**\n",
"- **The dataset is weighted**\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([21854.981705, 18169.604822, 17191.832515, ..., 3896.116219,\n",
" 4883.851005, 6630.588948])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"instance_weights = MEPSDataset19_data.instance_weights\n",
"instance_weights\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Taille du dataset 15830, poids total du dataset 141367240.546316.'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"f\"Taille du dataset {len(instance_weights)}, poids total du dataset {instance_weights.sum()}.\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Premier appercu du dataset\n",
"\n",
"La librairie AIF360 fournie une surcouche au dataset, cela le rend un peu moins intuitif d'utilisation (par exemple pour étudier/visualiser les attributs un à un), mais elle permet de calculer les métrique des fairness en une ligne de commande."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:root:No module named 'tensorflow': AdversarialDebiasing will be unavailable. To install, run:\n",
"pip install 'aif360[AdversarialDebiasing]'\n",
"WARNING:root:No module named 'tensorflow': AdversarialDebiasing will be unavailable. To install, run:\n",
"pip install 'aif360[AdversarialDebiasing]'\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.49826823461176517\n"
]
}
],
"source": [
"from aif360.metrics import BinaryLabelDatasetMetric\n",
"from aif360.metrics import ClassificationMetric\n",
"\n",
"metric_orig_panel19_train = BinaryLabelDatasetMetric(\n",
" MEPSDataset19_data,\n",
" unprivileged_groups=[{'RACE': 0}],\n",
" privileged_groups=[{'RACE': 1}])\n",
"\n",
"print(metric_orig_panel19_train.disparate_impact())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cependant le but de ce TD étant encore de manipuler les données et de les analyser nous allons revenir aux données sous forme d'un dataframe.\n",
"\n",
"Note pour calculer les métriques de fairness sans avoir à les réimplémenter dans le cas pondéré (instances weights) vous pouvez utiliser les méthodes implémenter dans AIF360 là [Implémentation Métriques de Fairness](https://aif360.readthedocs.io/en/latest/modules/sklearn.html#module-aif360.sklearn.metrics)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Conversion en un dataframe\n",
"\n",
"Nous avons vu que la somme des poids est conséquente, pres de 115millions nous ne pouvons donc pas raisonneblement dupliqué chaque ligne autant de fois que son poids.\n",
"\n",
"Nous allons stocker la pondération et la prendre en compte ensuite dans notre analyse"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def get_df(MepsDataset):\n",
" data = MepsDataset.convert_to_dataframe()\n",
" # data_train est un tuple, avec le data_frame et un dictionnaire avec toutes les infos (poids, attributs sensibles etc)\n",
" df = data[0]\n",
" df['WEIGHT'] = data[1]['instance_weights']\n",
" return df\n",
"\n",
"df = get_df(MEPSDataset19_data)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1.1848351529675123, 0.7849286063696154)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from aif360.sklearn.metrics import disparate_impact_ratio, base_rate\n",
"dir = disparate_impact_ratio(\n",
" y_true=df.UTILIZATION, \n",
" prot_attr=df.RACE, \n",
" pos_label=0,\n",
" sample_weight=df.WEIGHT)\n",
"br =base_rate(\n",
" y_true=df.UTILIZATION, \n",
" pos_label=0,\n",
" sample_weight=df.WEIGHT)\n",
"dir,br"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(1.1746792888264614, 0.8283006948831333)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dir = disparate_impact_ratio(\n",
" y_true=df.UTILIZATION, \n",
" prot_attr=df.RACE, \n",
" pos_label=0)\n",
"br =base_rate(\n",
" y_true=df.UTILIZATION, \n",
" pos_label=0)\n",
"dir,br"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Question 1 - Apprendre un modèle pour prédire le fait d'être réadmis\n",
"### 1.1 - Faire le pre-processing des données\n",
"\n",
"Ici ce pre-processing a déjà été fait par AIF, nous avons simplement converti le dataset en dataframe pour pouvoir le manipuler librement"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 1.2 - Creer les échantillons d'apprentissage, de validation et de test\n",
"\n",
"Pour créer le df_X il faut enlever l'outcome (\"UTILIZATION\") et la pondération (\"WEIGHT\")\n",
"\n",
"La colonne \"UTILIZATION\" sera le label (noté y)\n",
"\n",
"La colonne \"WEIGHT\" sera la pondération (notée w)\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"df_X = df.drop(columns=[\"UTILIZATION\", \"WEIGHT\"])\n",
"splits_trainval_test = train_test_split(\n",
" df_X, df[\"UTILIZATION\"], df[\"WEIGHT\"],\n",
" train_size=0.8, \n",
" random_state=42)\n",
"\n",
"X_trainval, X_test, y_trainval, y_test, w_trainval, w_test = splits_trainval_test\n",
"splits_train_val = train_test_split(\n",
" X_trainval, y_trainval, w_trainval,\n",
" train_size=0.625, \n",
" random_state=42)\n",
"X_train, X_val, y_train, y_val, w_train, w_val = splits_train_val"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"((7915, 138),\n",
" (7915,),\n",
" (7915,),\n",
" (4749, 138),\n",
" (4749,),\n",
" (4749,),\n",
" (3166, 138),\n",
" (3166,),\n",
" (3166,))"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.shape, y_train.shape, w_train.shape, X_val.shape, y_val.shape, w_val.shape, X_test.shape, y_test.shape, w_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Question 1.3 - Apprendre une regression logistique dont le but est de prédire UTILIZATION"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.8429447781148914"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.pipeline import make_pipeline\n",
"\n",
"\n",
"model = make_pipeline(StandardScaler(),LogisticRegression(random_state=42))\n",
"\n",
"model = model.fit(X_train, y_train, **{'logisticregression__sample_weight':w_train})\n",
"\n",
"preds = model.predict(X_val)\n",
"\n",
"model.score(X_val, y_val, sample_weight=w_val)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Quesiton 1.4 Performance du modèle (afficher la matrice de confusion)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3773 176 456 344\n"
]
},
{
"data": {
"text/html": [
" \n",
" "
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.plotly.v1+json": {
"config": {
"plotlyServerURL": "https://plot.ly"
},
"data": [
{
"coloraxis": "coloraxis",
"hovertemplate": "Pred: %{x}
Truth: %{y}
color: %{z}