{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "QZJSs4w9BCK-" }, "source": [ "# TD 1: Fairness notion examples\n", "\n", "In this first TD we are going to manipulate some data and see the behaviour of the different fairness metrics" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "qRJyTtCtBCLA" }, "source": [ "Environnement python3.10" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 41006, "status": "ok", "timestamp": 1707201882799, "user": { "displayName": "Alice H", "userId": "13901604971984976961" }, "user_tz": -60 }, "id": "Ozn0aZUVBKKg", "outputId": "beda3d58-c106-4b6f-d632-b414af965070" }, "outputs": [], "source": [ "!python --version\n", "!pip install --upgrade pip\n", "!pip install numpy==1.25\n", "!pip install fairlearn==0.9.0\n", "!pip install plotly==5.24.1\n", "!pip install nbformat==5.10.4\n", "!pip install ipykernel==6.29.5" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "Oam0vn35n7jK" }, "source": [ "# !!! Attention !!!, après avoir executé la cellule ci-dessus, il faudra redémarrer la session (onglet \"Execution\") afin de charger l'environnement installé" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "MCKXApFuBCLC" }, "source": [ "\n", "## Objectives\n", "\n", "\n", " 1. Study the data, the distribution of each feature and its relation to the target.\n", "\n", " 2. Highlight some bias present in the data\n", "\n", " 3. Learn a basic machine learning model using logistic regression\n", "\n", " 4. Compute the confusion matrix and different fairness metrics\n", "\n", "## Dataset: Diabetes 130-Hospitals\n", "\n", "\n", "https://fairlearn.org/main/api_reference/generated/fairlearn.datasets.fetch_diabetes_hospital.html\n", "\n", "Ce dataset contient 101,766 lignes chacunes concernant un patient hospitalisé pour du diabètes sur une durée allant de 1 à 14 jours. Les données ont été récoltées sur 10 ans et 130 hopitaux différents. Chaque donnée possède 25 caractéristiques concernant des informations médicales, mais aussi demographiques, enfin la colonne 'readmitted' indique si le patient a été réadmis, et si oui s'il l'a été dans les 30jours ou après. Cette colonne est binarisée en deux autres 'readmit_30_days' (True si réadmis dans les 30 jours, False sinon) et 'readmitted' ( True si réadmis, False sinon).\n", "\n", "Nous utiliserons en label/vérité, la colonne 'readmit_30_days'.\n", "\n", "Nous allons simplifier en ne considérant qu'un sous-ensemble de 14 des caractéristiques fournies:\n", "age, gender, race, time_in_hospital, num_lab_procedures, num_procedures, num_medications, number_diagnoses, max_glu_serum, A1Cresult, insulin, had_emergency, had_inpatient_days, had_outpatient_days\n", "\n", "\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "id": "v6YE9bSd3qgG" }, "source": [ "## Download and simplify the dataset" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 19, "status": "ok", "timestamp": 1707201882800, "user": { "displayName": "Alice H", "userId": "13901604971984976961" }, "user_tz": -60 }, "id": "XbEeXIwwKLqa", "outputId": "d2c14362-19a1-40df-de7c-de63275d5c5d" }, "outputs": [ { "data": { "text/plain": [ "('1.25.0', '0.9.0')" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import fairlearn\n", "np.__version__, fairlearn.__version__" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "executionInfo": { "elapsed": 19762, "status": "ok", "timestamp": 1707201902546, "user": { "displayName": "Alice H", "userId": "13901604971984976961" }, "user_tz": -60 }, "id": "n3HDPkgdBCLD" }, "outputs": [], "source": [ "from fairlearn.datasets import fetch_diabetes_hospital\n", "dataset = fetch_diabetes_hospital()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 669 }, "executionInfo": { "elapsed": 11, "status": "ok", "timestamp": 1707201902547, "user": { "displayName": "Alice H", "userId": "13901604971984976961" }, "user_tz": -60 }, "id": "eS5ap_RGBCLG", "outputId": "c51a8989-1135-4177-8c61-1196c30e428f" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipykernel_1912488/3640427316.py:20: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n", " df.had_emergency = df.had_emergency.replace({\"True\":1, \"False\":0})\n", "/tmp/ipykernel_1912488/3640427316.py:20: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.\n", " df.had_emergency = df.had_emergency.replace({\"True\":1, \"False\":0})\n", "/tmp/ipykernel_1912488/3640427316.py:21: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n", " df.had_inpatient_days = df.had_inpatient_days.replace({\"True\":1, \"False\":0})\n", "/tmp/ipykernel_1912488/3640427316.py:21: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.\n", " df.had_inpatient_days = df.had_inpatient_days.replace({\"True\":1, \"False\":0})\n", "/tmp/ipykernel_1912488/3640427316.py:22: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n", " df.had_outpatient_days = df.had_outpatient_days.replace({\"True\":1, \"False\":0})\n", "/tmp/ipykernel_1912488/3640427316.py:22: FutureWarning: The behavior of Series.replace (and DataFrame.replace) with CategoricalDtype is deprecated. In a future version, replace will only be used for cases that preserve the categories. To change the categories, use ser.cat.rename_categories instead.\n", " df.had_outpatient_days = df.had_outpatient_days.replace({\"True\":1, \"False\":0})\n" ] }, { "data": { "text/html": [ "
\n", " | age | \n", "gender | \n", "race | \n", "time_in_hospital | \n", "num_lab_procedures | \n", "num_procedures | \n", "num_medications | \n", "number_diagnoses | \n", "max_glu_serum | \n", "A1Cresult | \n", "insulin | \n", "had_emergency | \n", "had_inpatient_days | \n", "had_outpatient_days | \n", "readmit_30_days | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "'30 years or younger' | \n", "Female | \n", "Caucasian | \n", "1 | \n", "41 | \n", "0 | \n", "1 | \n", "1 | \n", "None | \n", "None | \n", "No | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
1 | \n", "'30 years or younger' | \n", "Female | \n", "Caucasian | \n", "3 | \n", "59 | \n", "0 | \n", "18 | \n", "9 | \n", "None | \n", "None | \n", "Up | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
2 | \n", "'30 years or younger' | \n", "Female | \n", "AfricanAmerican | \n", "2 | \n", "11 | \n", "5 | \n", "13 | \n", "6 | \n", "None | \n", "None | \n", "No | \n", "0 | \n", "1 | \n", "1 | \n", "0 | \n", "
3 | \n", "'30-60 years' | \n", "Male | \n", "Caucasian | \n", "2 | \n", "44 | \n", "1 | \n", "16 | \n", "7 | \n", "None | \n", "None | \n", "Up | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
4 | \n", "'30-60 years' | \n", "Male | \n", "Caucasian | \n", "1 | \n", "51 | \n", "0 | \n", "8 | \n", "5 | \n", "None | \n", "None | \n", "Steady | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
101761 | \n", "'Over 60 years' | \n", "Male | \n", "AfricanAmerican | \n", "3 | \n", "51 | \n", "0 | \n", "16 | \n", "9 | \n", "None | \n", ">8 | \n", "Down | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
101762 | \n", "'Over 60 years' | \n", "Female | \n", "AfricanAmerican | \n", "5 | \n", "33 | \n", "3 | \n", "18 | \n", "9 | \n", "None | \n", "None | \n", "Steady | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
101763 | \n", "'Over 60 years' | \n", "Male | \n", "Caucasian | \n", "1 | \n", "53 | \n", "0 | \n", "9 | \n", "13 | \n", "None | \n", "None | \n", "Down | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
101764 | \n", "'Over 60 years' | \n", "Female | \n", "Caucasian | \n", "10 | \n", "45 | \n", "2 | \n", "21 | \n", "9 | \n", "None | \n", "None | \n", "Up | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
101765 | \n", "'Over 60 years' | \n", "Male | \n", "Caucasian | \n", "6 | \n", "13 | \n", "3 | \n", "3 | \n", "9 | \n", "None | \n", "None | \n", "No | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
101766 rows × 15 columns
\n", "\n", " | time_in_hospital | \n", "num_lab_procedures | \n", "num_procedures | \n", "num_medications | \n", "number_diagnoses | \n", "race_AfricanAmerican | \n", "race_Asian | \n", "race_Caucasian | \n", "race_Hispanic | \n", "race_Other | \n", "... | \n", "A1Cresult_Norm | \n", "insulin_Down | \n", "insulin_No | \n", "insulin_Steady | \n", "insulin_Up | \n", "had_outpatient_days_0 | \n", "had_outpatient_days_1 | \n", "gender_Female | \n", "gender_Male | \n", "gender_Unknown/Invalid | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "41 | \n", "0 | \n", "1 | \n", "1 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "
1 | \n", "3 | \n", "59 | \n", "0 | \n", "18 | \n", "9 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "
2 | \n", "2 | \n", "11 | \n", "5 | \n", "13 | \n", "6 | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "
3 | \n", "2 | \n", "44 | \n", "1 | \n", "16 | \n", "7 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "
4 | \n", "1 | \n", "51 | \n", "0 | \n", "8 | \n", "5 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
101761 | \n", "3 | \n", "51 | \n", "0 | \n", "16 | \n", "9 | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "... | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "
101762 | \n", "5 | \n", "33 | \n", "3 | \n", "18 | \n", "9 | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "
101763 | \n", "1 | \n", "53 | \n", "0 | \n", "9 | \n", "13 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "... | \n", "False | \n", "True | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "False | \n", "True | \n", "False | \n", "
101764 | \n", "10 | \n", "45 | \n", "2 | \n", "21 | \n", "9 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "False | \n", "False | \n", "True | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "
101765 | \n", "6 | \n", "13 | \n", "3 | \n", "3 | \n", "9 | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "... | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "False | \n", "
101766 rows × 35 columns
\n", "