Tutorial

1 Overview

This tutorial walks through how to install the package, run the full data pipeline, and reproduce the core analysis used in this project.

The goal is to provide a clear, end-to-end workflow so that another user can generate the dataset and explore earthquake impact patterns independently.

2 Installation

First, clone the repository and install the package locally.

git clone https://github.com/amandazweller/Final-project.git
cd Final-project
pip install -e .

This installs the earthquake_analysis package in editable mode, allowing you to use its functions directly.

3 Running the Data Pipeline

The full data pipeline collects, merges, and cleans the dataset.

python scripts/run_pipeline.py

This script performs the following steps:

Fetches earthquake event data from the USGS API
Fetches significant earthquake impact data from NOAA/NCEI
Matches events across datasets using time and geographic proximity
Cleans and standardizes the merged dataset
Outputs processed data files to the data/ directory

After running, you should see files such as:

data/cleaned.csv
data/analysis_subset.csv

4 Using the Package

You can also run individual steps manually using the package functions.

from earthquake_analysis import (
    fetch_ncei,
    fetch_usgs,
    merge_usgs_ncei,
    clean_merged,
    magnitude_vs_impact,
)

# Fetch raw data
df_ncei = fetch_ncei(min_year=2010, max_year=2024)
df_usgs = fetch_usgs("2010-01-01", "2024-12-31")

# Merge datasets
merged = merge_usgs_ncei(df_usgs, df_ncei)

# Clean data
cleaned = clean_merged(merged)

# Run an example analysis
result = magnitude_vs_impact(cleaned)
print(result.head())

This modular structure allows you to:

inspect intermediate data
modify matching criteria
run custom analyses

5 Running the Dashboard

To explore the results interactively, launch the Streamlit app:

streamlit run app.py

The dashboard provides:

magnitude vs. impact comparisons
depth-based summaries
regional vulnerability visualizations
time-series trends for deaths and damage

6 Reproducibility Notes

The pipeline depends on external APIs (USGS and NOAA/NCEI)
Running the pipeline may take several minutes due to data retrieval
Results may vary slightly over time as source data updates

7 Troubleshooting

Issue: Missing dataset when launching the app
→ Run the pipeline first:

python scripts/run_pipeline.py

Issue: Slow data retrieval
→ This is expected when fetching large time ranges from the USGS API

Issue: Import errors
→ Ensure the package is installed using:

pip install -e .

8 Summary

This tutorial demonstrates how to:

install and configure the project
generate the dataset from raw sources
run analysis functions
launch the interactive dashboard

Together, these steps reproduce the full data science workflow used in the project.