将表转换为存在/不存在矩阵python
问题内容:
从两个表格文件中:
file1.txt
name1 house1
name2 house1
name3 house1
name4 house2
name5 house2
name6 house2
和file2.txt
name1 car
name2 bike
name3 skate
name4 car
name5 motorcycle
name6 boat
我想使用来自两个字典的信息来创建这样的存在/不在场矩阵。
car motorcycle bike boat skate
house1 1 0 1 0 1
house2 1 1 0 1 0
这是我的代码:
import pandas as pd
with open('file1.txt', 'r') as file1:
col_names = ['name', 'house']
df1 = pd.read_csv(file1, sep='\t', header=None, names=col_names)
with open('file2.txt', 'r') as file2:
col_names = ['name', 'transport']
df2 = pd.read_csv(file2, sep='\t', header=None, names=col_names)
# include the values from df1 into the df2 creating a new column
df2['house'] = df2['name'].map(df1.set_index('name')['house'])
g = df2.groupby('house')['transport'].apply(list).reset_index()
g.join(pd.get_dummies(g['transport'].apply(pd.Series).stack()).sum(level=0)).drop('transport', 1)
print g
这样做,我获得了以下输出:
house transport
0 house1 [car, bike, skate]
1 house2 [car, motorcycle, boat]
问题答案:
这是一种方法。
设定
# df1
individual house
0 name1 house1
1 name2 house1
2 name3 house1
3 name4 house2
4 name5 house2
5 name6 house2
# df2
individual transport
0 name1 car
1 name2 bike
2 name3 skate
3 name4 car
4 name5 motorcycle
5 name6 boat
解
df2['house'] = df2['individual'].map(df1.set_index('individual')['house'])
g = df2.groupby('house')['transport'].apply(list).reset_index()
g.join(pd.get_dummies(g['transport']\
.apply(pd.Series).stack()).sum(level=0)).drop('transport', 1)
结果
house bike boat car motorcycle skate
0 house1 1 0 1 0 1
1 house2 0 1 1 1 0
说明
共有3个步骤:
- 将“房子”列添加到
df2
,从映射df1
。 - 集体运输。
- 使用
pd.get_dummies
扩展transports
列空列。