tri par liste personnalisée dans pandas

Question

Après avoir lu: http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.DataFrame.sort.html

Je n'arrive toujours pas à comprendre comment trier une colonne en fonction d'une liste personnalisée. De toute évidence, le tri par défaut est alphabétique. Je vais vous donner un exemple. Voici ma trame de données (très abrégée):

 Player Year Age Tm G 2967 Cedric Hunter 1991 27 CHH 6 5335 Maurice Baker 2004 25 VAN 7 13950 Ratko Varda 2001 22 TOT 60 6141 Ryan Bowen 2009 34 OKC 52 6169 Adrian Caldwell 1997 31 DAL 81

Je veux pouvoir trier par joueur, année et ensuite Tm. Le tri par défaut par joueur et par année me convient, dans l'ordre normal. Cependant, je ne veux pas que l'équipe soit triée alphabétiquement b/c, je veux que TOT soit toujours en haut.

Voici la liste que j'ai créée:

sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL', 'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']

Après avoir lu le lien ci-dessus, j'ai pensé que cela fonctionnerait, mais cela n'a pas:

df.sort(['Player', 'Year', 'Tm'], ascending = [True, True, sorter])

Il a toujours ATL en haut, ce qui signifie qu'il est trié par ordre alphabétique et non selon ma liste personnalisée. Toute aide serait vraiment grandement appréciée, je ne peux pas comprendre cela.

Guillaume Jacquenot · Accepted Answer

Voici un exemple qui effectue un tri lexicographique sur une trame de données. L'idée est de créer un index numérique basé sur le tri spécifique. Ensuite, pour effectuer un tri numérique basé sur l'index. Pour ce faire, une colonne est ajoutée à la trame de données, puis supprimée.

import pandas as pd # Create DataFrame df = pd.DataFrame( {'id':[2967, 5335, 13950, 6141, 6169],\ 'Player': ['Cedric Hunter', 'Maurice Baker' ,\ 'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],\ 'Year': [1991 ,2004 ,2001 ,2009 ,1997],\ 'Age': [27 ,25 ,22 ,34 ,31],\ 'Tm':['CHH' ,'VAN' ,'TOT' ,'OKC' ,'DAL'],\ 'G':[6 ,7 ,60 ,52 ,81]}) # Define the sorter sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL','DEN',\ 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL',\ 'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI',\ 'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN',\ 'WAS', 'WSB'] # Create the dictionary that defines the order for sorting sorterIndex = dict(Zip(sorter,range(len(sorter)))) # Generate a rank column that will be used to sort # the dataframe numerically df['Tm_Rank'] = df['Tm'].map(sorterIndex) # Here is the result asked with the lexicographic sort # Result may be hard to analyze, so a second sorting is # proposed next ## NOTE: ## Newer versions of pandas use 'sort_values' instead of 'sort' df.sort_values(['Player', 'Year', 'Tm_Rank'], \ ascending = [True, True, True], inplace = True) df.drop('Tm_Rank', 1, inplace = True) print(df) # Here is an example where 'Tm' is sorted first, that will # give the first row of the DataFrame df to contain TOT as 'Tm' df['Tm_Rank'] = df['Tm'].map(sorterIndex) ## NOTE: ## Newer versions of pandas use 'sort_values' instead of 'sort' df.sort_values(['Tm_Rank', 'Player', 'Year'], \ ascending = [True , True, True], inplace = True) df.drop('Tm_Rank', 1, inplace = True) print(df)

dmeu · Answer

Je viens de découvrir qu'avec pandas 15.1, il est possible d'utiliser des séries catégorielles ( http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html# catégories )

Comme pour votre exemple, définissons la même trame de données et le même trieur:

import pandas as pd data = { 'id': [2967, 5335, 13950, 6141, 6169], 'Player': ['Cedric Hunter', 'Maurice Baker', 'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'], 'Year': [1991, 2004, 2001, 2009, 1997], 'Age': [27, 25, 22, 34, 31], 'Tm': ['CHH', 'VAN', 'TOT', 'OKC', 'DAL'], 'G': [6, 7, 60, 52, 81] } # Create DataFrame df = pd.DataFrame(data) # Define the sorter sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL', 'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB']

Avec le bloc de données et le trieur, qui est un ordre de catégorie, nous pouvons faire ce qui suit dans pandas 15.1:

# Convert Tm-column to category and in set the sorter as categories hierarchy # Youc could also do both lines in one just appending the cat.set_categories() df.Tm = df.Tm.astype("category") df.Tm.cat.set_categories(sorter, inplace=True) print(df.Tm) Out[48]: 0 CHH 1 VAN 2 TOT 3 OKC 4 DAL Name: Tm, dtype: category Categories (38, object): [TOT < ATL < BOS < BRK ... UTA < VAN < WAS < WSB] df.sort_values(["Tm"]) ## 'sort' changed to 'sort_values' Out[49]: Age G Player Tm Year id 2 22 60 Ratko Varda TOT 2001 13950 0 27 6 Cedric Hunter CHH 1991 2967 4 31 81 Adrian Caldwell DAL 1997 6169 3 34 52 Ryan Bowen OKC 2009 6141 1 25 7 Maurice Baker VAN 2004 5335

Mithril · Answer

Mon idée est de générer un numéro de tri par index, puis de fusionner le numéro de tri dans la trame de données d'origine

import pandas as pd df = pd.DataFrame( {'id':[2967, 5335, 13950, 6141, 6169],\ 'Player': ['Cedric Hunter', 'Maurice Baker' ,\ 'Ratko Varda' ,'Ryan Bowen' ,'Adrian Caldwell'],\ 'Year': [1991 ,2004 ,2001 ,2009 ,1997],\ 'Age': [27 ,25 ,22 ,34 ,31],\ 'Tm':['CHH' ,'VAN' ,'TOT' ,'OKC' ,'DAL'],\ 'G':[6 ,7 ,60 ,52 ,81]}) sorter = ['TOT', 'ATL', 'BOS', 'BRK', 'CHA', 'CHH', 'CHI', 'CLE', 'DAL', 'DEN', 'DET', 'GSW', 'HOU', 'IND', 'LAC', 'LAL', 'MEM', 'MIA', 'MIL', 'MIN', 'NJN', 'NOH', 'NOK', 'NOP', 'NYK', 'OKC', 'ORL', 'PHI', 'PHO', 'POR', 'SAC', 'SAS', 'SEA', 'TOR', 'UTA', 'VAN', 'WAS', 'WSB'] x = pd.DataFrame({'Tm': sorter}) x.index = x.index.set_names('number') x = x.reset_index() df = pd.merge(df, x, how='left', on='Tm') df.sort_values(['Player', 'Year', 'number'], \ ascending = [True, True, True], inplace = True) df.drop('number', 1, inplace = True)