熊猫：groupby和可变权重

问题内容：

我有一个包含每个观察值的权重的数据集，我想使用来准备加权汇总，groupby但是对于如何最好地做到这一点却感到生疏。我认为这意味着自定义聚合功能。我的问题是如何正确处理不是逐项数据，而是逐组数据。也许这意味着最好是分步骤进行，而不是一劳永逸。

用伪代码，我正在寻找

#first, calculate weighted value
for each row:
  weighted jobs = weight * jobs
#then, for each city, sum these weights and divide by the count (sum of weights)
for each city:
  sum(weighted jobs)/sum(weight)

我不确定如何将“针对每个城市”的部分工作到自定义的汇总函数中，以及如何访问组级别的摘要。

模拟数据：

import pandas as pd
import numpy as np
np.random.seed(43)

## prep mock data
N = 100
industry = ['utilities','sales','real estate','finance']
city = ['sf','san mateo','oakland']
weight = np.random.randint(low=5,high=40,size=N)
jobs = np.random.randint(low=1,high=20,size=N)
ind = np.random.choice(industry, N)
cty = np.random.choice(city, N)
df_city =pd.DataFrame({'industry':ind,'city':cty,'weight':weight,'jobs':jobs})

问题答案：

只需将两列相乘：

In [11]: df_city['weighted_jobs'] = df_city['weight'] * df_city['jobs']

现在您可以按城市分组（并取总和）：

In [12]: df_city_sums = df_city.groupby('city').sum()

In [13]: df_city_sums
Out[13]: 
           jobs  weight  weighted_jobs
city                                  
oakland     362     690           7958
san mateo   367    1017           9026
sf          253     638           6209

[3 rows x 3 columns]

现在，您可以将两个和除，以获得所需的结果：

In [14]: df_city_sums['weighted_jobs'] / df_city_sums['jobs']
Out[14]: 
city
oakland      21.983425
san mateo    24.594005
sf           24.541502
dtype: float64

熊猫：groupby和可变权重

微信关注