{ "cells": [ { "cell_type": "markdown", "id": "dc808241-9a7c-4272-a318-fc2a8580ae87", "metadata": {}, "source": [ "# Pandas" ] }, { "cell_type": "code", "execution_count": 1, "id": "1d0f66fd-c46d-45a1-8b5f-04aaf57f8010", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "70640ee6-1d48-4d7f-a12d-ebb456874f47", "metadata": { "tags": [] }, "source": [ "### agg" ] }, { "cell_type": "markdown", "id": "aca60913-823b-4d3b-8d9b-6f2852bca160", "metadata": {}, "source": [ "**aggregation的时候顺带修改新生成列的名字:**" ] }, { "cell_type": "code", "execution_count": 2, "id": "07787439-cee5-43da-813f-41a60ff536aa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
kindheightweight
0cat9.17.9
1dog6.07.5
2cat9.59.9
3dog34.0198.0
\n", "
" ], "text/plain": [ " kind height weight\n", "0 cat 9.1 7.9\n", "1 dog 6.0 7.5\n", "2 cat 9.5 9.9\n", "3 dog 34.0 198.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],\n", " 'height': [9.1, 6.0, 9.5, 34.0],\n", " 'weight': [7.9, 7.5, 9.9, 198.0]})\n", "df" ] }, { "cell_type": "markdown", "id": "570e0c8a-d86e-4474-863f-4df47d5388d7", "metadata": {}, "source": [ "**aggregation的时候考虑多个列,并且改名:**" ] }, { "cell_type": "code", "execution_count": 15, "id": "2389cc9c-38ac-422d-a699-a863b9b64032", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
worst textureworst areatarget
017.332019.00
123.411956.00
225.531709.00
326.50567.70
416.671575.00
............
56426.402027.00
56538.251731.00
56634.121124.00
56739.421821.00
56830.37268.61
\n", "

569 rows × 3 columns

\n", "
" ], "text/plain": [ " worst texture worst area target\n", "0 17.33 2019.0 0\n", "1 23.41 1956.0 0\n", "2 25.53 1709.0 0\n", "3 26.50 567.7 0\n", "4 16.67 1575.0 0\n", ".. ... ... ...\n", "564 26.40 2027.0 0\n", "565 38.25 1731.0 0\n", "566 34.12 1124.0 0\n", "567 39.42 1821.0 0\n", "568 30.37 268.6 1\n", "\n", "[569 rows x 3 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import datasets\n", "from sklearn.datasets import load_breast_cancer\n", " \n", "# data is loaded in a DataFrame\n", "cancer_data = load_breast_cancer()\n", "df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)\n", "df['target'] = pd.Series(cancer_data.target)\n", "df1 = df[['worst texture', 'worst area', 'target']]\n", "df1" ] }, { "cell_type": "code", "execution_count": 17, "id": "ae5b65f1-2caa-44a5-8c85-032f17cdd169", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
worst textureworst areatargetMean worst textureMean worst area
017.332019.0029.3182081422.286321
123.411956.0029.3182081422.286321
225.531709.0029.3182081422.286321
326.50567.7029.3182081422.286321
416.671575.0029.3182081422.286321
..................
56426.402027.0029.3182081422.286321
56538.251731.0029.3182081422.286321
56634.121124.0029.3182081422.286321
56739.421821.0029.3182081422.286321
56830.37268.6123.515070558.899440
\n", "

569 rows × 5 columns

\n", "
" ], "text/plain": [ " worst texture worst area target Mean worst texture Mean worst area\n", "0 17.33 2019.0 0 29.318208 1422.286321\n", "1 23.41 1956.0 0 29.318208 1422.286321\n", "2 25.53 1709.0 0 29.318208 1422.286321\n", "3 26.50 567.7 0 29.318208 1422.286321\n", "4 16.67 1575.0 0 29.318208 1422.286321\n", ".. ... ... ... ... ...\n", "564 26.40 2027.0 0 29.318208 1422.286321\n", "565 38.25 1731.0 0 29.318208 1422.286321\n", "566 34.12 1124.0 0 29.318208 1422.286321\n", "567 39.42 1821.0 0 29.318208 1422.286321\n", "568 30.37 268.6 1 23.515070 558.899440\n", "\n", "[569 rows x 5 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def meanofTargets(group1):\n", " \n", " wt = group1['worst texture'].agg('mean')\n", " wa = group1['worst area'].agg('mean')\n", " group1['Mean worst texture'] = wt\n", " group1['Mean worst area'] = wa\n", " return group1\n", "df2 = df1.groupby('target').apply(meanofTargets)\n", "df2" ] }, { "cell_type": "markdown", "id": "e4b588d7-0f9f-457f-8d64-b325f02fcd49", "metadata": {}, "source": [ "**类比dplyr和SQL的聚合**\n", "\n", "pandas实现GROUP BY环境下的SUM() / SUM()并且能对输出命名的过程是比较复杂的,举一个小例子:" ] }, { "cell_type": "code", "execution_count": 33, "id": "d2351097-a58d-4c8d-acb3-848b773a1465", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
taskexperimentsstrategyn_layer_droplayer_dropNum examplesinference_timetrain_timen_epochtrain_time_per_epoch...inference_throughputtrain_time_per_epoch_baselineinference_latency_baselineinference_throughput_baselinediff_train_time_per_epochpercentage_diff_train_time_per_epochdiff_inference_latencypercentage_diff_inference_latencydiff_inference_throughputpercentage_diff_inference_throughput
0colaRemove LayersBottom Drop20,11043.09.267023326.4418593108.813953...112.549625138.7389820.008906112.284841-29.925029-0.275011-0.000021-0.0023580.2647840.002353
1colaRemove LayersBottom Drop40,1,2,31043.09.261514262.275043387.425014...112.616577138.7389820.008906112.284841-51.313968-0.586948-0.000026-0.0029540.3317360.002946
2colaRemove LayersBottom Drop60,1,2,3,4,51043.09.268895198.315210366.105070...112.526899138.7389820.008906112.284841-72.633912-1.098765-0.000019-0.0021560.2420580.002151
3colaRemove LayersSymmetric Drop25,61043.09.273199326.1411963108.713732...112.474666138.7389820.008906112.284841-30.025250-0.276186-0.000015-0.0016910.1898250.001688
4colaRemove LayersSymmetric Drop44,5,6,71043.09.271496262.707250387.569083...112.495329138.7389820.008906112.284841-51.169899-0.584337-0.000017-0.0018750.2104880.001871
\n", "

5 rows × 21 columns

\n", "
" ], "text/plain": [ " task experiments strategy n_layer_drop layer_drop \\\n", "0 cola Remove Layers Bottom Drop 2 0,1 \n", "1 cola Remove Layers Bottom Drop 4 0,1,2,3 \n", "2 cola Remove Layers Bottom Drop 6 0,1,2,3,4,5 \n", "3 cola Remove Layers Symmetric Drop 2 5,6 \n", "4 cola Remove Layers Symmetric Drop 4 4,5,6,7 \n", "\n", " Num examples inference_time train_time n_epoch train_time_per_epoch \\\n", "0 1043.0 9.267023 326.441859 3 108.813953 \n", "1 1043.0 9.261514 262.275043 3 87.425014 \n", "2 1043.0 9.268895 198.315210 3 66.105070 \n", "3 1043.0 9.273199 326.141196 3 108.713732 \n", "4 1043.0 9.271496 262.707250 3 87.569083 \n", "\n", " ... inference_throughput train_time_per_epoch_baseline \\\n", "0 ... 112.549625 138.738982 \n", "1 ... 112.616577 138.738982 \n", "2 ... 112.526899 138.738982 \n", "3 ... 112.474666 138.738982 \n", "4 ... 112.495329 138.738982 \n", "\n", " inference_latency_baseline inference_throughput_baseline \\\n", "0 0.008906 112.284841 \n", "1 0.008906 112.284841 \n", "2 0.008906 112.284841 \n", "3 0.008906 112.284841 \n", "4 0.008906 112.284841 \n", "\n", " diff_train_time_per_epoch percentage_diff_train_time_per_epoch \\\n", "0 -29.925029 -0.275011 \n", "1 -51.313968 -0.586948 \n", "2 -72.633912 -1.098765 \n", "3 -30.025250 -0.276186 \n", "4 -51.169899 -0.584337 \n", "\n", " diff_inference_latency percentage_diff_inference_latency \\\n", "0 -0.000021 -0.002358 \n", "1 -0.000026 -0.002954 \n", "2 -0.000019 -0.002156 \n", "3 -0.000015 -0.001691 \n", "4 -0.000017 -0.001875 \n", "\n", " diff_inference_throughput percentage_diff_inference_throughput \n", "0 0.264784 0.002353 \n", "1 0.331736 0.002946 \n", "2 0.242058 0.002151 \n", "3 0.189825 0.001688 \n", "4 0.210488 0.001871 \n", "\n", "[5 rows x 21 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "layer_drop_time = pd.read_pickle('data/pandas_example_layer_drop_time.pickle')\n", "def get_speedup(df):\n", " df_output = df[['task', 'n_layer_drop']].iloc[0]\n", " df_output['speedup'] = sum(df['inference_latency']) / sum(df['inference_latency_baseline'])\n", " return df_output\n", "layer_drop_time.head(5)" ] }, { "cell_type": "code", "execution_count": 32, "id": "0e59ef8d-dd68-441a-81d7-ecc0fd7a2505", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
taskn_layer_dropspeedup
0cola20.998095
1cola40.997573
2cola60.997965
3mrpc20.998157
4mrpc41.000850
5mrpc61.000129
6rte20.837115
7rte40.721320
8rte60.543230
9sst-220.998929
10sst-240.998318
11sst-260.998895
12sts-b21.347481
13sts-b41.056215
14sts-b60.805969
15wnli20.819518
16wnli40.660700
17wnli60.539791
\n", "
" ], "text/plain": [ " task n_layer_drop speedup\n", "0 cola 2 0.998095\n", "1 cola 4 0.997573\n", "2 cola 6 0.997965\n", "3 mrpc 2 0.998157\n", "4 mrpc 4 1.000850\n", "5 mrpc 6 1.000129\n", "6 rte 2 0.837115\n", "7 rte 4 0.721320\n", "8 rte 6 0.543230\n", "9 sst-2 2 0.998929\n", "10 sst-2 4 0.998318\n", "11 sst-2 6 0.998895\n", "12 sts-b 2 1.347481\n", "13 sts-b 4 1.056215\n", "14 sts-b 6 0.805969\n", "15 wnli 2 0.819518\n", "16 wnli 4 0.660700\n", "17 wnli 6 0.539791" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "layer_drop_time.groupby(['task', 'n_layer_drop'], as_index=False).apply(get_speedup).reset_index(drop=True)" ] }, { "cell_type": "markdown", "id": "41f61118-3a10-4209-92e1-731b700c0369", "metadata": {}, "source": [ "- 这个地方apply传入的`df`其实就是shuffle好的每个group,而且包含了group key,\n", "\n", "- 我们可以写一个类似reduce的函数,但第一步需要先把group的column给deduplicate了~(不做的话就写成一个复杂版的transform的了)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" } }, "nbformat": 4, "nbformat_minor": 5 }