Exploratory Data Analysis (EDA) on Titanic Dataset

Muhammad Dawood
3 min readJun 15, 2023
Project 1: Exploratory Data Analysis (EDA) on Titanic Dataset

Introduction

The Titanic dataset is popular for data analysis and machine learning. It contains information about the passengers onboard the Titanic, including features like age, gender, fare, cabin, and survival status. We will perform exploratory data analysis (EDA) on the Titanic dataset using Python in this project.

Dataset

The Titanic dataset is available in Seaborn as the ‘titanic’ dataset. It consists of the following columns:

  • Survived: Survival status (0 = No, 1 = Yes)
  • Pclass: Passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
  • Sex: Passenger’s gender
  • Age: Passenger’s age
  • SibSp: Number of siblings/spouses aboard
  • Parch: Number of parents/children aboard
  • Fare: Fare paid for the ticket
  • Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
  • Class: Equivalent to Pclass (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
  • Who: Passenger’s category (man, woman, child)
  • Adult_male: Whether the passenger is an adult male or not (True or False)
  • Deck: Cabin deck
  • Embark_town: Port of embarkation (Cherbourg, Queenstown, Southampton)
  • Alive: Survival status (yes or no)
  • Alone: Whether the passenger is alone or not (True or False)
  • Adult_male: Whether the passenger is an adult male or not (True or False)
  • Alone: Whether the passenger is alone or not (True or False)
  • Alive: Survival status (yes or no)
  • Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
  • Class: Equivalent to Pclass (1 = 1st class, 2 = 2nd class, 3 = 3rd class)

Project Steps

1. Importing Libraries:

Let’s start by importing the required libraries for data analysis and visualization:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

2. Loading the Dataset:

The Titanic dataset is available in the Seaborn library, so we can load it directly into a DataFrame:

df = sns.load_dataset('titanic')

3. Exploring the Data:

To gain initial insights into the dataset, we can perform some basic exploratory operations:

# Display the first few rows of the dataset
print(df.head())

# Check the dimensions of the dataset
print(df.shape)

# Get summary statistics of numerical variables
print(df.describe())

# Check the data types of variables
print(df.dtypes)

# Check for missing values
print(df.isnull().sum())

4. Data Cleaning:

Data cleaning is an essential step in EDA. We must handle missing values, outliers, and inconsistencies in the dataset. Some common data-cleaning tasks include:

# Handling missing values
df.dropna(inplace=True) # Remove rows with missing values
df.fillna(value, inplace=True) # Fill missing values with a specific value

# Handling outliers
# Identify and remove outliers using statistical methods or domain knowledge

# Data transformation
# Perform necessary transformations like scaling, encoding, or feature engineering

5. Data Visualization:

Visualization helps us understand the data and identify patterns. We can create various types of plots using libraries like Matplotlib and Seaborn. Here are some examples:

# Bar plot
sns.countplot(x='Survived', data=df)
plt.xlabel('Survival Status')
plt.ylabel('Count')
plt.title('Survival Count')
plt.show()

# Histogram
plt.hist(df['Age'], bins=10)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Age')
plt.show()

# Scatter plot
plt.scatter(df['Age'], df['Fare'])
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Age vs. Fare')
plt.show()

# Box plot
sns.boxplot(x=df['Survived'], y=df['Fare'])
plt.xlabel('Survival Status')
plt.ylabel('Fare')
plt.title('Survival Status vs. Fare')
plt.show()

6. Exploring Relationships:

EDA involves exploring relationships between variables to uncover insights. We can use techniques like correlation analysis or cross-tabulation for this purpose.

# Correlation analysis
correlation = df[['Age', 'Fare']].corr()
print(correlation)

# Cross-tabulation
cross_tab = pd.crosstab(df['Pclass'], df['Survived'])
print(cross_tab)

7. Conclusion:

Based on the exploratory data analysis, we can summarize the key findings, insights, and potential areas for further investigation. This could include patterns, trends, outliers, or relationships observed during the analysis.

This project provides a basic outline of how to perform exploratory data analysis on the Titanic dataset using Python. Additional analysis and modeling can be performed based on specific research questions or objectives.

--

--

Muhammad Dawood

Embarking on a journey to unlock the power of data-driven insights. Exploring the world of statistics and machine learning. | Researcher | Curious!