python panda:返回常见行的索引


问题内容

抱歉,如果这是一个相当新手的问题。我试图找出两个数据帧之间共有哪些行。返回值应该是与df2通用的行索引df1。我笨拙的例子:

df1 = pd.DataFrame({'col1':['cx','cx','cx2'], 'col2':[1,4,12]})
df1['col2'] = df1['col2'].map(str);
df2 = pd.DataFrame({'col1':['cx','cx','cx','cx','cx2','cx2'], 'col2':[1,3,5,10,12,12]})
df2['col2'] = df2['col2'].map(str);

df1['idx'] = df1[['col1','col2']].apply(lambda x: '_'.join(x),axis=1);
df2['idx'] = df2[['col1','col2']].apply(lambda x: '_'.join(x),axis=1);

df1['idx_values'] = df1.index.values
df2['idx_values'] = df2.index.values

df3 = pd.merge(df1,df2,on = 'idx');
myindexes = df3['idx_values_y'];

myindexes.to_csv(idir + 'test.txt',sep='\t',index = False);

返回值应为[0,4,5]。高效执行此操作将非常棒,因为两个数据帧将具有几百万行。


问题答案:

不需要带有连接值的新列,默认情况下,通过两个列进行内部合并,如果需要,则使用df2.indexadd值进行合并reset_index

df1 = pd.DataFrame({'col1':['cx','cx','cx2'], 'col2':[1,4,12]})
df2 = pd.DataFrame({'col1':['cx','cx','cx','cx','cx2','cx2'], 'col2':[1,3,5,10,12,12]})

df3 = pd.merge(df1,df2.reset_index(), on = ['col1','col2'])
print (df3)
  col1 col2  index
0   cx    1      0
1  cx2   12      4
2  cx2   12      5

对于两个索引都需要:

df4 = pd.merge(df1.reset_index(),df2.reset_index(), on = ['col1','col2'])
print (df4)

   index_x col1  col2  index_y
0        0   cx     1        0
1        2  cx2    12        4
2        2  cx2    12        5

仅对于两个DataFrame的交集:

df5 = pd.merge(df1,df2, on = ['col1','col2'])
#if 2 column DataFrame   
#df5 = pd.merge(df1,df2)
print (df5)

  col1  col2
0   cx     1
1  cx2    12
2  cx2    12