Tutorial
1 Overview
This tutorial walks through how to install the package, run the full data pipeline, and reproduce the core analysis used in this project.
The goal is to provide a clear, end-to-end workflow so that another user can generate the dataset and explore earthquake impact patterns independently.
2 Installation
First, clone the repository and install the package locally.
git clone https://github.com/amandazweller/Final-project.git
cd Final-project
pip install -e .This installs the earthquake_analysis package in editable mode, allowing you to use its functions directly.
3 Running the Data Pipeline
The full data pipeline collects, merges, and cleans the dataset.
python scripts/run_pipeline.pyThis script performs the following steps:
- Fetches earthquake event data from the USGS API
- Fetches significant earthquake impact data from NOAA/NCEI
- Matches events across datasets using time and geographic proximity
- Cleans and standardizes the merged dataset
- Outputs processed data files to the
data/directory
After running, you should see files such as:
data/cleaned.csv
data/analysis_subset.csv
4 Using the Package
You can also run individual steps manually using the package functions.
from earthquake_analysis import (
fetch_ncei,
fetch_usgs,
merge_usgs_ncei,
clean_merged,
magnitude_vs_impact,
)
# Fetch raw data
df_ncei = fetch_ncei(min_year=2010, max_year=2024)
df_usgs = fetch_usgs("2010-01-01", "2024-12-31")
# Merge datasets
merged = merge_usgs_ncei(df_usgs, df_ncei)
# Clean data
cleaned = clean_merged(merged)
# Run an example analysis
result = magnitude_vs_impact(cleaned)
print(result.head())This modular structure allows you to:
- inspect intermediate data
- modify matching criteria
- run custom analyses
5 Running the Dashboard
To explore the results interactively, launch the Streamlit app:
streamlit run app.pyThe dashboard provides:
- magnitude vs. impact comparisons
- depth-based summaries
- regional vulnerability visualizations
- time-series trends for deaths and damage
6 Reproducibility Notes
- The pipeline depends on external APIs (USGS and NOAA/NCEI)
- Running the pipeline may take several minutes due to data retrieval
- Results may vary slightly over time as source data updates
7 Troubleshooting
Issue: Missing dataset when launching the app
→ Run the pipeline first:
python scripts/run_pipeline.pyIssue: Slow data retrieval
→ This is expected when fetching large time ranges from the USGS API
Issue: Import errors
→ Ensure the package is installed using:
pip install -e .8 Summary
This tutorial demonstrates how to:
- install and configure the project
- generate the dataset from raw sources
- run analysis functions
- launch the interactive dashboard
Together, these steps reproduce the full data science workflow used in the project.