- Published on
Pandas Tutorial: DataFrames and Data Analysis in Python

Table of Contents
Introduction
Most data work starts with a CSV file and ends with a cleaned, transformed dataset ready for analysis. Pandas sits in the middle of that journey, handling the messy work of loading, cleaning, reshaping, and aggregating data. If you work with tabular data in Python, Pandas is the tool you will use daily.
Pandas provides two primary data structures: Series and DataFrame. A Series is a one-dimensional array-like object that can hold any type of data, such as integers, floats, strings, and even Python objects. A DataFrame, on the other hand, represents a two-dimensional table with rows and columns, similar to a spreadsheet or a SQL table.
Basic Commands
To work with Pandas, you first need to import the library. You can do this using the following line:
import pandas as pd
To read data, you can use the read_csv() function, which reads data from a CSV file and stores it in a DataFrame:
df = pd.read_csv('data.csv')
To write data, you can use the to_csv() function, which writes the contents of a DataFrame to a CSV file:
df.to_csv('output.csv', index=False)
Other basic commands include:
head(): Displays the first five rows of a DataFrametail(): Displays the last five rows of a DataFrameinfo(): Provides information about the DataFrame, such as its shape, data types, and memory usagedescribe(): Generates descriptive statistics of the DataFrame, such as count, mean, and standard deviation
Data Manipulation
Pandas provides a wide range of functions for manipulating data, including indexing, slicing, filtering, and sorting. Here are some examples:
- Indexing:
df.loc[3] # Returns the row with index 3
df.iloc[3] # Returns the fourth row
df.loc[3, 'column'] # Returns the value of the 'column' in the row with index 3
df.at[3, 'column'] # Returns the scalar value in the cell at position [3, 'column']
- Slicing:
df[3:6] # Returns rows 3 to 5
df['column'][3:6] # Returns rows 3 to 5 of the 'column'
- Filtering:
df[df['column'] < 10] # Returns all rows where the 'column' is less than 10
df[(df['column1'] < 10) & (df['column2'] > 5)] # Returns all rows where the 'column1' is less than 10 and 'column2' is greater than 5
- Sorting:
df.sort_values('column', ascending=False) # Sorts the DataFrame by the 'column' in descending order
df.sort_index(ascending=False) # Sorts the DataFrame by its index in descending order
Data Visualization
Pandas provides multiple visualization options, including line plots, scatter plots, histograms, and box plots. Here is an example of how to create a line plot:
df.plot(kind='line', x='date', y='sales')
Other visualization options include:
- Scatter plot:
df.plot(kind='scatter', x='age', y='income')
- Histogram:
df['column'].hist(bins=10)
- Box plot:
df.boxplot(column=['column1', 'column2'])
Grouping and Aggregating
Pandas allows you to group data together based on one or more columns and perform aggregate operations on those groups. Here is an example:
grouped = df.groupby('column')
grouped['column2'].mean()
This groups the DataFrame by the values in 'column' and calculates the mean of the 'column2' for each group.
Pivot Tables
Pandas also provides functions for creating pivot tables, which are similar to Excel pivot tables. Pivot tables allow you to summarize data by grouping and aggregating it in different ways. Here is an example:
pivot = df.pivot_table(index='date', columns='category', values='sales', aggfunc='sum')
This creates a pivot table with the 'date' column as the index, the 'category' column as the columns, and the 'sales' column as the values. It calculates the sum of the 'sales' for each combination of date and category.
Handling Missing Data
Pandas provides multiple functions for handling missing data, including dropping or filling missing values. Here are some examples:
- Dropping missing values:
df.dropna() # Drops all rows with any missing values
df.dropna(how='all') # Drops rows where all values are missing
df.dropna(thresh=2) # Drops rows with at least 2 non-missing values
- Filling missing values:
df.fillna(0) # Fills all missing values with 0
df.fillna(method='ffill') # Fills missing values with the previous non-missing value (forward filling)
df.fillna(method='bfill') # Fills missing values with the next non-missing value (backward filling)
Merging and Joining Data
Pandas allows you to merge or join data from different sources, such as CSV files, SQL tables, or Excel spreadsheets. Here are some examples:
- Merging two DataFrames:
merged = pd.merge(df1, df2, on='key')
This merges two DataFrames on the 'key' column.
- Joining two DataFrames:
joined = df1.join(df2, lsuffix='_left', rsuffix='_right')
This joins two DataFrames based on their index.
Time Series Analysis
Pandas provides extensive support for working with time series data, such as stock prices or weather data. Here are some examples:
- Creating a time series DataFrame:
index = pd.date_range('2020-01-01', '2020-12-31')
ts = pd.DataFrame(index=index)
This creates a DataFrame with an index containing all dates from January 1 to December 31, 2020.
- Resampling time series data:
ts.resample('M').mean()
This resamples the time series data to monthly intervals and calculates the mean value for each month.
Advanced Topics
Pandas provides advanced features such as multi-level indexing, reshaping data, and working with databases. Here are some examples:
- Multi-level indexing:
df.set_index(['column1', 'column2'], inplace=True)
This creates a DataFrame with two-level indexing based on the values of 'column1' and 'column2'.
- Reshaping data:
df.melt(id_vars=['id'], value_vars=['col1', 'col2'])
This reshapes a wide DataFrame to a long DataFrame, where multiple columns are converted to a single column.
- Working with databases:
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM table', conn)
This reads data from a SQL database into a DataFrame.
Conclusion
Pandas handles the full lifecycle of tabular data work: loading from various sources, cleaning and transforming, analyzing patterns, and exporting results. The operations covered here—indexing, filtering, grouping, merging, and time series handling—form the foundation for most data manipulation tasks. Start with DataFrames for structured work, and reach for the more advanced features like multi-level indexing when your data complexity demands it.
Related Topics
- Top 10 Python Libraries for Data Engineering - Explore other essential Python libraries
- Jupyter Notebooks - Interactive data analysis environment perfect for Pandas
- Data Processing Pipeline Patterns - Learn about data transformation patterns
Related Articles
Jupyter Notebooks: Advanced Tips and Best Practices
Advanced Jupyter: magic commands, extensions, performance optimization, production deployment, and collaboration tips.
Top 10 Python Libraries for Data Engineering
Essential Python libraries for data engineering: NumPy, Pandas, PySpark, SQLAlchemy, and more with practical use cases.
7 Python NLP Libraries: Complete Overview and Comparison
Overview of 7 Python NLP libraries: NLTK, spaCy, Gensim, TextBlob, Transformers, CoreNLP, and Pattern for text tasks.