
This project is a real-world data extraction, cleaning and exploratory analysis initiative conducted in collaboration with MuckRock, a nonprofit newsroom and transparency platform that helps the public request, analyze, and share government records.
The objective was to transform unstructured, multi-year PDF reports from the U.S. Army Recreation Machine Program into a clean, analyzable dataset and an interactive exploratory interface that surfaces revenue patterns across military bases worldwide.
Due to confidentiality and client restrictions, the raw dataset, internal documents, and certain implementation details cannot be publicly shared.
Project Scope
- Extract tabular data from highly inconsistent PDF reports (FY 2020–2024)
- Clean and standardize revenue, machine, and base-level fields
- Consolidate multi-format tables into a unified SQLite database
- Enable public-facing exploration via Datasette
- Surface early insights on revenue distribution, growth trends, and structural data gaps
My Key Contributions
1. Extraction Pipeline Research & Prototyping
- Evaluated multiple PDF extraction approaches:
- Adobe Acrobat / Adobe Express exports
- MuckRock extraction tools
- Python-based solutions (
tabula,camelot)
- Diagnosed critical failure modes:
- Page-based exports causing misaligned columns
- Broken text spans and missing rows
- Helped the team pivot to a table-format–specific extraction strategy, which became the backbone of the final pipeline.
2. Cleaning Pipeline for Asset Report – Format 4 Tables
For each fiscal year (2020–2024), I:
- Split PDFs into precise table segments
- Converted them into structured CSV files
- Wrote Python scripts to standardize:
- Column names
- Revenue fields
- Machine-type categories
- Base identifiers
- Validated data integrity and resolved residual alignment issues
- Ensured cross-year consistency to support downstream merging and analysis
3. Data Quality Review & Consolidation
After individual format-level cleaning:
- Assisted with merging yearly datasets
- Resolved conflicting column types and schemas
- Corrected repeated, missing, or inconsistent base names
- Identified systemic OCR-related errors inherited from early extraction attempts
4. Weekly Project Coordination & Client Alignment
- Participated in weekly internal team meetings and bi-weekly client check-ins
- Helped decompose large tasks (PDF parsing, cleaning, merging, deployment) into weekly sprints
- Coordinated communication around blockers such as:
- Inconsistent table layouts
- Missing fields
- Ambiguous metadata definitions
- Tracked deliverables across GitHub and shared documentation to maintain reproducibility and clarity
5. Exploratory Analysis & Visualization Preparation
Contributed to early analysis and visualization design, including:
- Per-branch revenue breakdowns
- Per-base revenue rankings
- Year-over-year revenue trends
- Machine-type distribution over time
These analyses informed both the Datasette interface and the final report structure.
6. Datasette Prototype
Helped prepare a user-friendly Datasette deployment by ensuring:
- Clear table relationships
- Filters for base, branch, region, and year
- Easy exploration of key metrics such as:
- Total revenue
- Machine counts
- Temporal trends
7. Final Presentation & Client Deliverables
- Co-authored final slides, particularly:
- Data extraction challenges
- Cleaning methodology
- Reproducible workflow design
- Contributed speaker notes explaining:
- Why PDF extraction was uniquely difficult
- How misaligned tables were corrected
- Limitations of the source data
- Helped articulate next-step recommendations, including:
- Deeper trend analysis
- Data completeness improvements
- Follow-up FOIA opportunities
Skills & Impact Demonstrated
- Real-world PDF data extraction under ambiguity
- Data cleaning and schema standardization
- Collaborative data engineering workflows
- Client-facing communication and expectation management
- Reproducible analytics for investigative journalism
- Translating messy government records into public insight