Pandas

Contents

Pandas

import pandas as pd
import numpy as np

agg

aggregation的时候顺带修改新生成列的名字:

df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                   'height': [9.1, 6.0, 9.5, 34.0],
                   'weight': [7.9, 7.5, 9.9, 198.0]})
df
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0

aggregation的时候考虑多个列,并且改名:

from sklearn import datasets
from sklearn.datasets import load_breast_cancer
 
# data is loaded in a DataFrame
cancer_data = load_breast_cancer()
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
df['target'] = pd.Series(cancer_data.target)
df1 = df[['worst texture', 'worst area', 'target']]
df1
worst texture worst area target
0 17.33 2019.0 0
1 23.41 1956.0 0
2 25.53 1709.0 0
3 26.50 567.7 0
4 16.67 1575.0 0
... ... ... ...
564 26.40 2027.0 0
565 38.25 1731.0 0
566 34.12 1124.0 0
567 39.42 1821.0 0
568 30.37 268.6 1

569 rows × 3 columns

def meanofTargets(group1):
   
    wt = group1['worst texture'].agg('mean')
    wa = group1['worst area'].agg('mean')
    group1['Mean worst texture'] = wt
    group1['Mean worst area'] = wa
    return group1
df2 = df1.groupby('target').apply(meanofTargets)
df2
worst texture worst area target Mean worst texture Mean worst area
0 17.33 2019.0 0 29.318208 1422.286321
1 23.41 1956.0 0 29.318208 1422.286321
2 25.53 1709.0 0 29.318208 1422.286321
3 26.50 567.7 0 29.318208 1422.286321
4 16.67 1575.0 0 29.318208 1422.286321
... ... ... ... ... ...
564 26.40 2027.0 0 29.318208 1422.286321
565 38.25 1731.0 0 29.318208 1422.286321
566 34.12 1124.0 0 29.318208 1422.286321
567 39.42 1821.0 0 29.318208 1422.286321
568 30.37 268.6 1 23.515070 558.899440

569 rows × 5 columns

类比dplyr和SQL的聚合

pandas实现GROUP BY环境下的SUM() / SUM()并且能对输出命名的过程是比较复杂的,举一个小例子:

layer_drop_time = pd.read_pickle('data/pandas_example_layer_drop_time.pickle')
def get_speedup(df):
    df_output = df[['task', 'n_layer_drop']].iloc[0]
    df_output['speedup'] = sum(df['inference_latency']) / sum(df['inference_latency_baseline'])
    return df_output
layer_drop_time.head(5)
task experiments strategy n_layer_drop layer_drop Num examples inference_time train_time n_epoch train_time_per_epoch ... inference_throughput train_time_per_epoch_baseline inference_latency_baseline inference_throughput_baseline diff_train_time_per_epoch percentage_diff_train_time_per_epoch diff_inference_latency percentage_diff_inference_latency diff_inference_throughput percentage_diff_inference_throughput
0 cola Remove Layers Bottom Drop 2 0,1 1043.0 9.267023 326.441859 3 108.813953 ... 112.549625 138.738982 0.008906 112.284841 -29.925029 -0.275011 -0.000021 -0.002358 0.264784 0.002353
1 cola Remove Layers Bottom Drop 4 0,1,2,3 1043.0 9.261514 262.275043 3 87.425014 ... 112.616577 138.738982 0.008906 112.284841 -51.313968 -0.586948 -0.000026 -0.002954 0.331736 0.002946
2 cola Remove Layers Bottom Drop 6 0,1,2,3,4,5 1043.0 9.268895 198.315210 3 66.105070 ... 112.526899 138.738982 0.008906 112.284841 -72.633912 -1.098765 -0.000019 -0.002156 0.242058 0.002151
3 cola Remove Layers Symmetric Drop 2 5,6 1043.0 9.273199 326.141196 3 108.713732 ... 112.474666 138.738982 0.008906 112.284841 -30.025250 -0.276186 -0.000015 -0.001691 0.189825 0.001688
4 cola Remove Layers Symmetric Drop 4 4,5,6,7 1043.0 9.271496 262.707250 3 87.569083 ... 112.495329 138.738982 0.008906 112.284841 -51.169899 -0.584337 -0.000017 -0.001875 0.210488 0.001871

5 rows × 21 columns

layer_drop_time.groupby(['task', 'n_layer_drop'], as_index=False).apply(get_speedup).reset_index(drop=True)
task n_layer_drop speedup
0 cola 2 0.998095
1 cola 4 0.997573
2 cola 6 0.997965
3 mrpc 2 0.998157
4 mrpc 4 1.000850
5 mrpc 6 1.000129
6 rte 2 0.837115
7 rte 4 0.721320
8 rte 6 0.543230
9 sst-2 2 0.998929
10 sst-2 4 0.998318
11 sst-2 6 0.998895
12 sts-b 2 1.347481
13 sts-b 4 1.056215
14 sts-b 6 0.805969
15 wnli 2 0.819518
16 wnli 4 0.660700
17 wnli 6 0.539791
  • 这个地方apply传入的df其实就是shuffle好的每个group,而且包含了group key,

  • 我们可以写一个类似reduce的函数,但第一步需要先把group的column给deduplicate了~(不做的话就写成一个复杂版的transform的了)