Cannot reindex from a duplicate axis как исправить

I am getting a ValueError: cannot reindex from a duplicate axis when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.

Here is my session inside of ipdb trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create sum index for sum of all columns I am getting ValueError: cannot reindex from a duplicate axis error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?

I don’t really understand what ValueError: cannot reindex from a duplicate axismeans, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.

ipdb> type(affinity_matrix)
<class 'pandas.core.frame.DataFrame'>
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype='int64')
ipdb> affinity_matrix.index
Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype='object')

ipdb> affinity_matrix.values.dtype
dtype('float64')
ipdb> 'sums' in affinity_matrix.index
False

Here is the error:

ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis

I tried to reproduce this with a simple example, but I failed

In [32]: import pandas as pd

In [33]: import numpy as np

In [34]: a = np.arange(35).reshape(5,7)

In [35]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))

In [36]: df.values.dtype
Out[36]: dtype('int64')

In [37]: df.loc['sums'] = df.sum(axis=0)

In [38]: df
Out[38]: 
      10  11  12  13  14  15   16
x      0   1   2   3   4   5    6
y      7   8   9  10  11  12   13
u     14  15  16  17  18  19   20
z     21  22  23  24  25  26   27
w     28  29  30  31  32  33   34
sums  70  75  80  85  90  95  100

Recently, I’ve been working with Pandas DataFrames that had a DateTime as the index. When I tried reindexing the DataFrame (using the reindex method), I bumped into an error. Let’s find out what causes it and how to solve it.

The Python error I’m talking about is:

ValueError: cannot reindex from a duplicate axis

A “duplicate axis”? My first assumption was that my DataFrame had the same index in for the columns and the rows, which makes no sense.

Apparently, the python error is the result of doing operations on a DataFrame that has duplicate index values. Operations that require unique index values need to align the values with the index. Joining with another DataFrame, reindexing a DataFrame, resampling a DataFrame simply will not work.

It makes one wonder why Pandas even supports duplicate values in the index. Doing some research, I found out it is something the Pandas team actively contemplated:

If you’re familiar with SQL, you know that row labels are similar to a primary key on a table, and you would never want duplicates in a SQL table. But one of pandas’ roles is to clean messy, real-world data before it goes to some downstream system. And real-world data has duplicates, even in fields that are supposed to be unique.

Unlike many other data wrangling libraries and solutions, Pandas acknowledges that the data you’ll be working with messy data. However, it wants to help you clean it up.

Test if an index contains duplicate values

Simply testing if the values in a Pandas DataFrame are unique is extremely easy. They’ve even created a method to it:

This will return a boolean: True if the index is unique. False if there are duplicate values.

Test which values in an index are duplicate

To test which values in an index are duplicate, one can use the duplicated method, which returns an array of boolean values to identify if a value has been encountered more than once.

Drop rows with duplicate index values

Using duplicated(), we can also remove values that are duplicates. Using the following line of code, when multiple rows share the same index, only the first one encountered will remain — following the same order in which the DataFrame is ordered, from top to bottom. All the others will be deleted.

df.loc[~df.index.duplicated(), :]

Prevent duplicate values in a DataFrame index

To make sure a Pandas DataFrame cannot contain duplicate values in the index, one can set a flag. Setting the allows_duplicate_labels flag to False will prevent the assignment of duplicate values.

df.flags.allows_duplicate_labels = False

Applying this flag to a DataFrame with duplicate values, or assigning duplicate values will result in the following error:

DuplicateLabelError: Index has duplicates.

Duplicate column names

Columns names are indices too. That’s why each of these methods also apply to columns.

df.columns.is_unique
df.columns.duplicated()
df.loc[:, ~df.columns.duplicated()]

By the way, I didn’t necessarily come up with this solution myself. Although I’m grateful you’ve visited this blog post, you should know I get a lot from websites like StackOverflow and I have a lot of coding books. This one by Matt Harrison (on Pandas 1.x!) has been updated in 2020 and is an absolute primer on Pandas basics. If you want something broad, ranging from data wrangling to machine learning, try “Mastering Pandas” by Stefanie Molin.

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Good luck on cleaning your data!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

Table of Contents
Hide
  1. Verify if your DataFrame Index contains Duplicate values
  2. Test which values in an index is duplicate
  3. Drop rows with duplicate index values
  4. Prevent duplicate values in a DataFrame index
  5. Overwrite DataFrame index with a new one

In Python, you will get a valueerror: cannot reindex from a duplicate axis usually when you set an index to a specific value, reindexing or resampling the DataFrame using reindex method.

If you look at the error message “cannot reindex from a duplicate axis“, it means that Pandas DataFrame has duplicate index values. Hence when we do certain operations such as concatenating a DataFrame, reindexing a DataFrame, or resampling a DataFrame in which the index has duplicate values, it will not work, and Python will throw a ValueError.

Verify if your DataFrame Index contains Duplicate values

When you get this error, the first thing you need to do is to check the DataFrame index for duplicate values using the below code.

df.index.is_unique

The index.is_unique method will return a boolean value. If the index has unique values, it returns True else False.

Test which values in an index is duplicate

If you want to check which values in an index have duplicates, you can use index.duplicated method as shown below.

df.index.duplicated()

The method returns an array of boolean values. The duplicated values are returned as True in an array.

idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama'])
idx.duplicated()

Output

array([False, False,  True, False,  True])

Drop rows with duplicate index values

By using the same index.duplicated method, we can remove the duplicate values in the DataFrame using the following code.

It will traverse the DataFrame from a top-down approach and ensure all the duplicate values in the index are removed, and the unique values are preserved.

df.loc[~df.index.duplicated(), :]

Alternatively, if you use the latest version, you can even use the method df.drop_duplicates() as shown below.

Consider dataset containing ramen rating.

>>> df = pd.DataFrame({
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, it removes duplicate rows based on all columns.

>>> df.drop_duplicates()
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

To remove duplicates on specific column(s), use subset.

>>> df.drop_duplicates(subset=['brand'])
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

To remove duplicates and keep last occurrences, use keep.

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
    brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0

Prevent duplicate values in a DataFrame index

If you want to ensure Pandas DataFrame without duplicate values in the index, one can set a flag. Setting the allows_duplicate_labels flag to False will prevent the assignment of duplicate values.

df.flags.allows_duplicate_labels = False

Applying this flag to a DataFrame with duplicate values or assigning duplicate values will result in DuplicateLabelError: Index has duplicates.

Overwrite DataFrame index with a new one

Alternatively, to overwrite your current DataFrame index with a new one:

df.index = new_index

or, use .reset_index:

df.reset_index(level=0, inplace=True)

Remove inplace=True if you want it to return the dataframe.

Avatar Of Srinivas Ramakrishna

Srinivas Ramakrishna is a Solution Architect and has 14+ Years of Experience in the Software Industry. He has published many articles on Medium, Hackernoon, dev.to and solved many problems in StackOverflow. He has core expertise in various technologies such as Microsoft .NET Core, Python, Node.JS, JavaScript, Cloud (Azure), RDBMS (MSSQL), React, Powershell, etc.

Cannot reindex DataFrame with duplicated axis

Let’s start by writing some simple Python coder to define a pandas DataFrame. In reality, most probably you will be acquiring your data from an external file, database or API.

import pandas as pd

stamps = ['01-02-23', ['01-02-23','01-02-24'] , '01-02-24', '01-03-24', '02-03-24']
sales_team = ['North', 'South', 'West', 'East', 'South']
revenue = [109.0, 201.0, 156.0, 181.0, 117.0]

rev_df = pd.DataFrame (dict(time = stamps, team = sales_team, revenue = revenue) )

print (rev_df)

We will get the following data set:

time team revenue
0 01-02-23 North 109.0
1 [01-02-23, 01-02-24] South 201.0
2 01-02-24 West 156.0
3 01-03-24 East 181.0
4 02-03-24 South 117.0

As the time column contains a list, we will break down the second row to two different rows using the explode() function.

new_rev_df = rev_df.explode('time')
print(new_rev_df.head())

One feature of explode() is that it replicates indexes. We will get the following data:

time team revenue
0 01-02-23 North 109.0
1 01-02-23 South 201.0
1 01-02-24 South 201.0
2 01-02-24 West 156.0
3 01-03-24 East 181.0

Trying to re-index the DataFrame so that the time column becomes the index, will fail with a Valueerror exception:

idx = ['time']
new_rev_df.reindex(idx)

The error message will be:

ValueError: cannot reindex on an axis with duplicate labels

I have encountered this error also when invoking the Seaborn library on data containing duplicated indexes.

Fixing the error

There are a couple of ways that can help to circumvent this error messages.

Aggregate the data

We can groupby the data and then save it as a DataFrame. Note that with this option no data is removed from your DataFrame

new_rev_df.groupby(['time','team']).revenue.sum().to_frame()

Remove Duplicated indexes

We can use the pandas loc indexer in order to get rid of any duplicated indexes. Using this option the second duplicated index is removed.

dup_idx = new_rev_df.index.duplicated()
new_rev_df.loc[~dup_idx]

Note: In this tutorial we replicated the problem for cases in which the row index is duplicated. You might as well encounter this issue when working with datasets, typically wide ones, that including duplicated columns.

Related learning

How to merge columns of a Pandas DataFrame object?

I am getting a ValueError: cannot reindex from a duplicate axis when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.

Here is my session inside of ipdb trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create sum index for sum of all columns I am getting ValueError: cannot reindex from a duplicate axis error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?

I don’t really understand what ValueError: cannot reindex from a duplicate axismeans, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.

ipdb> type(affinity_matrix)
<class 'pandas.core.frame.DataFrame'>
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype='int64')
ipdb> affinity_matrix.index
Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype='object')

ipdb> affinity_matrix.values.dtype
dtype('float64')
ipdb> 'sums' in affinity_matrix.index
False

Here is the error:

ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis

I tried to reproduce this with a simple example, but I failed

In [32]: import pandas as pd

In [33]: import numpy as np

In [34]: a = np.arange(35).reshape(5,7)

In [35]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))

In [36]: df.values.dtype
Out[36]: dtype('int64')

In [37]: df.loc['sums'] = df.sum(axis=0)

In [38]: df
Out[38]: 
      10  11  12  13  14  15   16
x      0   1   2   3   4   5    6
y      7   8   9  10  11  12   13
u     14  15  16  17  18  19   20
z     21  22  23  24  25  26   27
w     28  29  30  31  32  33   34
sums  70  75  80  85  90  95  100

Добавить комментарий