Pandas
Contents
Pandas¶
import pandas as pd
import numpy as np
agg¶
aggregation的时候顺带修改新生成列的名字:
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
'height': [9.1, 6.0, 9.5, 34.0],
'weight': [7.9, 7.5, 9.9, 198.0]})
df
kind | height | weight | |
---|---|---|---|
0 | cat | 9.1 | 7.9 |
1 | dog | 6.0 | 7.5 |
2 | cat | 9.5 | 9.9 |
3 | dog | 34.0 | 198.0 |
aggregation的时候考虑多个列,并且改名:
from sklearn import datasets
from sklearn.datasets import load_breast_cancer
# data is loaded in a DataFrame
cancer_data = load_breast_cancer()
df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
df['target'] = pd.Series(cancer_data.target)
df1 = df[['worst texture', 'worst area', 'target']]
df1
worst texture | worst area | target | |
---|---|---|---|
0 | 17.33 | 2019.0 | 0 |
1 | 23.41 | 1956.0 | 0 |
2 | 25.53 | 1709.0 | 0 |
3 | 26.50 | 567.7 | 0 |
4 | 16.67 | 1575.0 | 0 |
... | ... | ... | ... |
564 | 26.40 | 2027.0 | 0 |
565 | 38.25 | 1731.0 | 0 |
566 | 34.12 | 1124.0 | 0 |
567 | 39.42 | 1821.0 | 0 |
568 | 30.37 | 268.6 | 1 |
569 rows × 3 columns
def meanofTargets(group1):
wt = group1['worst texture'].agg('mean')
wa = group1['worst area'].agg('mean')
group1['Mean worst texture'] = wt
group1['Mean worst area'] = wa
return group1
df2 = df1.groupby('target').apply(meanofTargets)
df2
worst texture | worst area | target | Mean worst texture | Mean worst area | |
---|---|---|---|---|---|
0 | 17.33 | 2019.0 | 0 | 29.318208 | 1422.286321 |
1 | 23.41 | 1956.0 | 0 | 29.318208 | 1422.286321 |
2 | 25.53 | 1709.0 | 0 | 29.318208 | 1422.286321 |
3 | 26.50 | 567.7 | 0 | 29.318208 | 1422.286321 |
4 | 16.67 | 1575.0 | 0 | 29.318208 | 1422.286321 |
... | ... | ... | ... | ... | ... |
564 | 26.40 | 2027.0 | 0 | 29.318208 | 1422.286321 |
565 | 38.25 | 1731.0 | 0 | 29.318208 | 1422.286321 |
566 | 34.12 | 1124.0 | 0 | 29.318208 | 1422.286321 |
567 | 39.42 | 1821.0 | 0 | 29.318208 | 1422.286321 |
568 | 30.37 | 268.6 | 1 | 23.515070 | 558.899440 |
569 rows × 5 columns
类比dplyr和SQL的聚合
pandas实现GROUP BY环境下的SUM() / SUM()并且能对输出命名的过程是比较复杂的,举一个小例子:
layer_drop_time = pd.read_pickle('data/pandas_example_layer_drop_time.pickle')
def get_speedup(df):
df_output = df[['task', 'n_layer_drop']].iloc[0]
df_output['speedup'] = sum(df['inference_latency']) / sum(df['inference_latency_baseline'])
return df_output
layer_drop_time.head(5)
task | experiments | strategy | n_layer_drop | layer_drop | Num examples | inference_time | train_time | n_epoch | train_time_per_epoch | ... | inference_throughput | train_time_per_epoch_baseline | inference_latency_baseline | inference_throughput_baseline | diff_train_time_per_epoch | percentage_diff_train_time_per_epoch | diff_inference_latency | percentage_diff_inference_latency | diff_inference_throughput | percentage_diff_inference_throughput | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | cola | Remove Layers | Bottom Drop | 2 | 0,1 | 1043.0 | 9.267023 | 326.441859 | 3 | 108.813953 | ... | 112.549625 | 138.738982 | 0.008906 | 112.284841 | -29.925029 | -0.275011 | -0.000021 | -0.002358 | 0.264784 | 0.002353 |
1 | cola | Remove Layers | Bottom Drop | 4 | 0,1,2,3 | 1043.0 | 9.261514 | 262.275043 | 3 | 87.425014 | ... | 112.616577 | 138.738982 | 0.008906 | 112.284841 | -51.313968 | -0.586948 | -0.000026 | -0.002954 | 0.331736 | 0.002946 |
2 | cola | Remove Layers | Bottom Drop | 6 | 0,1,2,3,4,5 | 1043.0 | 9.268895 | 198.315210 | 3 | 66.105070 | ... | 112.526899 | 138.738982 | 0.008906 | 112.284841 | -72.633912 | -1.098765 | -0.000019 | -0.002156 | 0.242058 | 0.002151 |
3 | cola | Remove Layers | Symmetric Drop | 2 | 5,6 | 1043.0 | 9.273199 | 326.141196 | 3 | 108.713732 | ... | 112.474666 | 138.738982 | 0.008906 | 112.284841 | -30.025250 | -0.276186 | -0.000015 | -0.001691 | 0.189825 | 0.001688 |
4 | cola | Remove Layers | Symmetric Drop | 4 | 4,5,6,7 | 1043.0 | 9.271496 | 262.707250 | 3 | 87.569083 | ... | 112.495329 | 138.738982 | 0.008906 | 112.284841 | -51.169899 | -0.584337 | -0.000017 | -0.001875 | 0.210488 | 0.001871 |
5 rows × 21 columns
layer_drop_time.groupby(['task', 'n_layer_drop'], as_index=False).apply(get_speedup).reset_index(drop=True)
task | n_layer_drop | speedup | |
---|---|---|---|
0 | cola | 2 | 0.998095 |
1 | cola | 4 | 0.997573 |
2 | cola | 6 | 0.997965 |
3 | mrpc | 2 | 0.998157 |
4 | mrpc | 4 | 1.000850 |
5 | mrpc | 6 | 1.000129 |
6 | rte | 2 | 0.837115 |
7 | rte | 4 | 0.721320 |
8 | rte | 6 | 0.543230 |
9 | sst-2 | 2 | 0.998929 |
10 | sst-2 | 4 | 0.998318 |
11 | sst-2 | 6 | 0.998895 |
12 | sts-b | 2 | 1.347481 |
13 | sts-b | 4 | 1.056215 |
14 | sts-b | 6 | 0.805969 |
15 | wnli | 2 | 0.819518 |
16 | wnli | 4 | 0.660700 |
17 | wnli | 6 | 0.539791 |
这个地方apply传入的
df
其实就是shuffle好的每个group,而且包含了group key,我们可以写一个类似reduce的函数,但第一步需要先把group的column给deduplicate了~(不做的话就写成一个复杂版的transform的了)