How to Split Pandas DataFrame into Multiple DataFrames by Column Value in Python

In this tutorial, I will show you how to create a separate DataFrame for each value for a given column.

Use the pandas.DataFrame.groupby(column) method to group the DataFrame by the values found in the column named column.

grouped_df = df.groupby('Column')

This method returns a GroupBy object. You can see this yourself by printing the new dataframe by running print grouped_df:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7efe3ec55670>

With the GroupBy object from the previous function, call the DataFrameGroupBy.get_group(group) method on the new dataframe grouped_df

This method will return a Dataframe of all the rows that have the value group in the column named column.

grouped_df.get_group('column_value')

Alternatively, you can call both methods in one line:

value_df = df.groupby('Column').get_group('column_value')

Return the n largest/smallest values for column in DataFrame

Get n-largest values from a particular column in Pandas DataFrame

df.nlargest(5, 'Gross')

Return the first n rows with the smallest values for column in DataFrame

df.nsmallest(5, ['Age'])

To order by the smallest values in column “Age” and then “Salary”, we can specify multiple columns like in the next example.

df.nsmallest(5, ['Age', 'Salary'])

There is also an optional keep parameter for the nlargest and nsmallest functions. keep has three possible values: {'first', 'last', 'all'}. The default is 'first'

Where there are duplicate values:

  • first : take the first occurrence.
  • last : take the last occurrence.
  • all : do not drop any duplicates, even it means selecting more than n items.
df.nlargest(5, 'Gross', keep='last')

Working with a New Dataset / DataFrame

When you are working with a new Pandas DataFrame, these attributes and methods will give you insights into key aspects of the data.

The dir function let’s you look at all of the attributes that a Python object has.

dir(df)

The shape attribute returns a tuple of integers indicating the number of elements that are stored along each dimension of an array. For a 2D-array with N rows and M columns, shape will be (N,M). 

df.shape

You may be working with a dataframe that has hundreds or thousands of rows. To get a glimpse of the data inside a dataframe without printing out all of the values you can use the head and tail methods.

Returns the first n rows in the dataframe

df.head() # returns rows 0-4
df.head(n) # returns the first n rows

Returns the last n rows in the dataframe

df.tail()
df.tail(n)

The count method of a dataframe shows you the number of entries for each column

df.count()

Check if there are any missing values in any of the columns

pd.isnull(df).any()

The info method of the dataframe gives a bunch of information. It tells

  1. The number of entries in the df
  2. The names of the columns
  3. The number of columns
  4. The number of entries in each column
  5. The dtype of each column
  6. If there are null values in a column
df.info()

Different Ways to Create Pandas DataFrames

A Pandas DataFrame is a 2D labeled data structure with columns of potentially different types.

There are a variety of different methods and syntaxes that can be used to create a pd.DataFrame.

Firstly, make sure you import the pandas module:

import pandas as pd

Method 1: Creating DataFrame from list of lists

# initialize list of lists
data = [['bob', 20], ['jane', 30], ['joe', 40]]
 
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Age'])
df

Output:

Method #2: Creating DataFrame from dictionary of lists

In this method, you define a dictionary which has the column name as the key which corresponds to an array of row values.

# initialize dictionary of lists
data = {'Name': ['Bob', 'Joe', 'Jane', 'Jack'],
        'Age': [30, 30, 21, 40]}
 
# Create DataFrame
df = pd.DataFrame(data)
df

Output:

You can use custom index values for the DataFrame by adding a parameter to the pd.DataFrame function. Set the optional index parameter of the pd.DataFrame function to an array of strings for the index values.

df = pd.DataFrame(data, index=['first',
                                'second',
                                'third',
                                'fourth'])
df

Output:

In the same way that we just defined the index values, you can also define the column names separately. Set the optional columns parameter of the pd.DataFrame function to an array of strings for the column values.

Notice that the row values are now defined as a list of lists rather than a dictionary of lists. This is because the column values are no longer being defined with them.

df = pd.DataFrame(
    [[4,5,6],
     [7,8,9],
     [10,11,12]],
    index = ['row_one','row_two','row_three'],
    columns=["a","b","c"]
    )

df

Output:

Method #3: Creating DataFrame using zip() function.

The zip function returns an iterator of tuples where the corresponding items in each passed iterator is paired together. By calling the list function on the object returned from the zip function, we convert the object to a list which can be passed into the pd.DataFrame function.

name = ["Bob", "Sam", "Sally", "Sue"]
age = [19, 17, 51, 49]

data = list(zip(name, age))

df = pd.DataFrame(data,
                  columns = ['Name', 'Age'])

df

Output: