Composé Seaborn avec axe des y normalisé par groupe

Question

Je me demandais s'il était possible de créer un diagramme de comptage Seaborn, mais au lieu des comptages réels sur l'axe des ordonnées, affichez la fréquence relative (pourcentage) dans son groupe (comme spécifié avec le paramètre hue).

J'ai en quelque sorte résolu cela avec l'approche suivante, mais je ne peux pas imaginer que c'est l'approche la plus simple:

# Plot percentage of occupation per income class grouped = df.groupby(['income'], sort=False) occupation_counts = grouped['occupation'].value_counts(normalize=True, sort=False) occupation_data = [ {'occupation': occupation, 'income': income, 'percentage': percentage*100} for (income, occupation), percentage in dict(occupation_counts).items() ] df_occupation = pd.DataFrame(occupation_data) p = sns.barplot(x="occupation", y="percentage", hue="income", data=df_occupation) _ = plt.setp(p.get_xticklabels(), rotation=90) # Rotate labels

Résultat:

J'utilise le jeu de données bien connu pour adultes du référentiel d'apprentissage automatique de l'UCI . Le dataframe des pandas est créé comme ceci:

# Read the adult dataset df = pd.read_csv( "data/adult.data", engine='c', lineterminator='
', names=['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income'], header=None, skipinitialspace=True, na_values="?" )

Cette question est en quelque sorte liée, mais n'utilise pas le paramètre hue. Et dans mon cas, je ne peux pas simplement changer les étiquettes sur l'axe des y, car la hauteur de la barre doit dépendre du groupe.

Pietro Battiston · Accepted Answer

Je pourrais être confus. La différence entre votre sortie et la sortie de

occupation_counts = (df.groupby(['income'])['occupation'] .value_counts(normalize=True) .rename('percentage') .mul(100) .reset_index() .sort_values('occupation')) p = sns.barplot(x="occupation", y="percentage", hue="income", data=occupation_counts) _ = plt.setp(p.get_xticklabels(), rotation=90) # Rotate labels

n’est, me semble-t-il, que l’ordre des colonnes.

Et vous semblez vous en soucier, puisque vous passez sort=False. Mais ensuite, dans votre code, l'ordre est déterminé uniquement par hasard (et l'ordre dans lequel le dictionnaire est itéré change même d'une exécution à l'autre avec Python 3.5).

BirdLaw · Answer

Je me suis dit que Seaborn ne fournissait rien de tel en dehors de la boîte.

Néanmoins, il était assez facile de modifier le code source pour obtenir ce que vous vouliez . Le code suivant, avec la fonction "percentplot (x, hue, data)" fonctionne exactement comme sns.countplot, mais les normes sont définies par barre. (ie divise la valeur de chaque barre verte par la somme de toutes les barres vertes)

En effet, cela transforme ceci (difficile à interpréter car N différents d'Apple par rapport à Android): sns.countplot (Normalement afin que les barres reflètent la proportion du total pour Apple par rapport à Android) : Pourcentageplot

J'espère que cela t'aides!!

from seaborn.categorical import _CategoricalPlotter, remove_na import matplotlib as mpl class _CategoricalStatPlotter(_CategoricalPlotter): @property def nested_width(self): """A float with the width of plot elements when hue nesting is used.""" return self.width / len(self.hue_names) def estimate_statistic(self, estimator, ci, n_boot): if self.hue_names is None: statistic = [] confint = [] else: statistic = [[] for _ in self.plot_data] confint = [[] for _ in self.plot_data] for i, group_data in enumerate(self.plot_data): # Option 1: we have a single layer of grouping # -------------------------------------------- if self.plot_hues is None: if self.plot_units is None: stat_data = remove_na(group_data) unit_data = None else: unit_data = self.plot_units[i] have = pd.notnull(np.c_[group_data, unit_data]).all(axis=1) stat_data = group_data[have] unit_data = unit_data[have] # Estimate a statistic from the vector of data if not stat_data.size: statistic.append(np.nan) else: statistic.append(estimator(stat_data, len(np.concatenate(self.plot_data)))) # Get a confidence interval for this estimate if ci is not None: if stat_data.size < 2: confint.append([np.nan, np.nan]) continue boots = bootstrap(stat_data, func=estimator, n_boot=n_boot, units=unit_data) confint.append(utils.ci(boots, ci)) # Option 2: we are grouping by a hue layer # ---------------------------------------- else: for j, hue_level in enumerate(self.hue_names): if not self.plot_hues[i].size: statistic[i].append(np.nan) if ci is not None: confint[i].append((np.nan, np.nan)) continue hue_mask = self.plot_hues[i] == hue_level group_total_n = (np.concatenate(self.plot_hues) == hue_level).sum() if self.plot_units is None: stat_data = remove_na(group_data[hue_mask]) unit_data = None else: group_units = self.plot_units[i] have = pd.notnull( np.c_[group_data, group_units] ).all(axis=1) stat_data = group_data[hue_mask & have] unit_data = group_units[hue_mask & have] # Estimate a statistic from the vector of data if not stat_data.size: statistic[i].append(np.nan) else: statistic[i].append(estimator(stat_data, group_total_n)) # Get a confidence interval for this estimate if ci is not None: if stat_data.size < 2: confint[i].append([np.nan, np.nan]) continue boots = bootstrap(stat_data, func=estimator, n_boot=n_boot, units=unit_data) confint[i].append(utils.ci(boots, ci)) # Save the resulting values for plotting self.statistic = np.array(statistic) self.confint = np.array(confint) # Rename the value label to reflect the estimation if self.value_label is not None: self.value_label = "{}({})".format(estimator.__name__, self.value_label) def draw_confints(self, ax, at_group, confint, colors, errwidth=None, capsize=None, **kws): if errwidth is not None: kws.setdefault("lw", errwidth) else: kws.setdefault("lw", mpl.rcParams["lines.linewidth"] * 1.8) for at, (ci_low, ci_high), color in Zip(at_group, confint, colors): if self.orient == "v": ax.plot([at, at], [ci_low, ci_high], color=color, **kws) if capsize is not None: ax.plot([at - capsize / 2, at + capsize / 2], [ci_low, ci_low], color=color, **kws) ax.plot([at - capsize / 2, at + capsize / 2], [ci_high, ci_high], color=color, **kws) else: ax.plot([ci_low, ci_high], [at, at], color=color, **kws) if capsize is not None: ax.plot([ci_low, ci_low], [at - capsize / 2, at + capsize / 2], color=color, **kws) ax.plot([ci_high, ci_high], [at - capsize / 2, at + capsize / 2], color=color, **kws) class _BarPlotter(_CategoricalStatPlotter): """Show point estimates and confidence intervals with bars.""" def __init__(self, x, y, hue, data, order, hue_order, estimator, ci, n_boot, units, orient, color, palette, saturation, errcolor, errwidth=None, capsize=None): """Initialize the plotter.""" self.establish_variables(x, y, hue, data, orient, order, hue_order, units) self.establish_colors(color, palette, saturation) self.estimate_statistic(estimator, ci, n_boot) self.errcolor = errcolor self.errwidth = errwidth self.capsize = capsize def draw_bars(self, ax, kws): """Draw the bars onto `ax`.""" # Get the right matplotlib function depending on the orientation barfunc = ax.bar if self.orient == "v" else ax.barh barpos = np.arange(len(self.statistic)) if self.plot_hues is None: # Draw the bars barfunc(barpos, self.statistic, self.width, color=self.colors, align="center", **kws) # Draw the confidence intervals errcolors = [self.errcolor] * len(barpos) self.draw_confints(ax, barpos, self.confint, errcolors, self.errwidth, self.capsize) else: for j, hue_level in enumerate(self.hue_names): # Draw the bars offpos = barpos + self.hue_offsets[j] barfunc(offpos, self.statistic[:, j], self.nested_width, color=self.colors[j], align="center", label=hue_level, **kws) # Draw the confidence intervals if self.confint.size: confint = self.confint[:, j] errcolors = [self.errcolor] * len(offpos) self.draw_confints(ax, offpos, confint, errcolors, self.errwidth, self.capsize) def plot(self, ax, bar_kws): """Make the plot.""" self.draw_bars(ax, bar_kws) self.annotate_axes(ax) if self.orient == "h": ax.invert_yaxis() def percentageplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None, orient=None, color=None, palette=None, saturation=.75, ax=None, **kwargs): # Estimator calculates required statistic (proportion) estimator = lambda x, y: (float(len(x))/y)*100 ci = None n_boot = 0 units = None errcolor = None if x is None and y is not None: orient = "h" x = y Elif y is None and x is not None: orient = "v" y = x Elif x is not None and y is not None: raise TypeError("Cannot pass values for both `x` and `y`") else: raise TypeError("Must pass values for either `x` or `y`") plotter = _BarPlotter(x, y, hue, data, order, hue_order, estimator, ci, n_boot, units, orient, color, palette, saturation, errcolor) plotter.value_label = "Percentage" if ax is None: ax = plt.gca() plotter.plot(ax, kwargs) return ax

Ted Petrou · Answer

Vous pouvez utiliser la bibliothèque Dexplot pour compter et normaliser toute variable afin d’obtenir des fréquences relatives.

Passez la variable aggplot une chaîne/catégorie au paramètre agg et il produira automatiquement un graphique à barres des comptes de toutes les valeurs uniques. Utilisez hue pour subdiviser les nombres en une autre variable. Notez que Dexplot encapsule automatiquement les étiquettes x-tick.

dxp.aggplot(agg='occupation', data=df, hue='income')

Utilisez le paramètre normalize pour normaliser les comptes par rapport à toute variable (ou combinaison de variables avec un tuple). Vous pouvez également utiliser "all" pour normaliser le total général des comptes.

dxp.aggplot('occupation', data=df, hue='income', normalize='income')