Titanic Mortality Prediction
Predicting the survival chances of the passengers aboard the Titanic using the Titanic passengers dataset.
Jupyter Notebook and project details can be found in this git repo
Project Details / Background
The primary objective of this project was to predict the survival chances of the passengers aboard the Titanic. By investigating the data and applying machine learning models, I aimed to understand which factors influenced survival rates the most.
Technologies Used:
  • Python/Jupyter Notebook: Used for all aspects of structure and analysis throughout this project.
  • Pandas and NumPy: For data preprocessing and all numerical operations.
  • Scikit-learn: Used to implement machine learning algorithms and preprocessing techniques.
  • Matplotlib and Seaborn: For data visualisation, helping to interpret the data effectively.
Methodology:
  1. Data Preprocessing
    Data preprocessing involved handling missing values, encoding categorical variables, and overall normalising data.
  2. Exploratory Data Analysis
    Through this stage, I visualised different aspects of the dataset using Pythons Matplotlib and Seaborn libraries. This helped in understanding the distribution of key variables and the relationship between different features and survival rates.
  3. Feature Engineering
    I created new features that could potentially increase the predictive power of the models. For instance, crafting a feature from passengers' port of embarkation, which provided insights into its impact on survival.
  4. Model Selection
    I experimented with various machine learning models, including logistic regression and random forests. Each model was evaluated based on its accuracy and precision to select the best performer.
  5. Model Tuning
    Using grid search and cross-validation, I fine-tuned the hyperparameters of the selected model to best enhance its performance.
  6. Prediction and Evaluation
    The final step involved making predictions on the test set and evaluating the model using metrics like accuracy, precision, and recall. This provided a quantitative measure of the model's performance.
Results:

The best and final model achieved a score of 0.826 on the test set, indicating that it correctly predicted the survival chances of passengers up to 82% of the time.

Upon submission to Kaggle, the best score was 0.787!
Conclusion:

This data science project was a really usefull experience that allowed me to delve deep into machine learning and data analysis with a fairly simple dataset. By applying these technologies and methodologies, I gained a deeper understanding of the factors that influenced survival during the Titanic disaster but more importantly, an up close experience with methedologies such as pipelines, feature encoding and one-hot encoding.

This project is a testament to how historical data can be leveraged to build predictive models that provide insights into past events, demonstrating the power and potential of data science.