US Military Slot Machine Revenue Analysis Overview

This project is a real-world data extraction, cleaning and exploratory analysis initiative conducted in collaboration with MuckRock, a nonprofit newsroom and transparency platform that helps the public request, analyze, and share government records.

The objective was to transform unstructured, multi-year PDF reports from the U.S. Army Recreation Machine Program into a clean, analyzable dataset and an interactive exploratory interface that surfaces revenue patterns across military bases worldwide.

Due to confidentiality and client restrictions, the raw dataset, internal documents, and certain implementation details cannot be publicly shared.

Project Scope

Extract tabular data from highly inconsistent PDF reports (FY 2020–2024)
Clean and standardize revenue, machine, and base-level fields
Consolidate multi-format tables into a unified SQLite database
Enable public-facing exploration via Datasette
Surface early insights on revenue distribution, growth trends, and structural data gaps

My Key Contributions

1. Extraction Pipeline Research & Prototyping

Evaluated multiple PDF extraction approaches:
- Adobe Acrobat / Adobe Express exports
- MuckRock extraction tools
- Python-based solutions (tabula, camelot)
Diagnosed critical failure modes:
- Page-based exports causing misaligned columns
- Broken text spans and missing rows
Helped the team pivot to a table-format–specific extraction strategy, which became the backbone of the final pipeline.

2. Cleaning Pipeline for Asset Report – Format 4 Tables

For each fiscal year (2020–2024), I:

Split PDFs into precise table segments
Converted them into structured CSV files
Wrote Python scripts to standardize:
- Column names
- Revenue fields
- Machine-type categories
- Base identifiers
Validated data integrity and resolved residual alignment issues
Ensured cross-year consistency to support downstream merging and analysis

3. Data Quality Review & Consolidation

After individual format-level cleaning:

Assisted with merging yearly datasets
Resolved conflicting column types and schemas
Corrected repeated, missing, or inconsistent base names
Identified systemic OCR-related errors inherited from early extraction attempts

4. Weekly Project Coordination & Client Alignment

Participated in weekly internal team meetings and bi-weekly client check-ins
Helped decompose large tasks (PDF parsing, cleaning, merging, deployment) into weekly sprints
Coordinated communication around blockers such as:
- Inconsistent table layouts
- Missing fields
- Ambiguous metadata definitions
Tracked deliverables across GitHub and shared documentation to maintain reproducibility and clarity

5. Exploratory Analysis & Visualization Preparation

Contributed to early analysis and visualization design, including:

Per-branch revenue breakdowns
Per-base revenue rankings
Year-over-year revenue trends
Machine-type distribution over time

These analyses informed both the Datasette interface and the final report structure.

6. Datasette Prototype

Helped prepare a user-friendly Datasette deployment by ensuring:

Clear table relationships
Filters for base, branch, region, and year
Easy exploration of key metrics such as:
- Total revenue
- Machine counts
- Temporal trends

7. Final Presentation & Client Deliverables

Co-authored final slides, particularly:
- Data extraction challenges
- Cleaning methodology
- Reproducible workflow design
Contributed speaker notes explaining:
- Why PDF extraction was uniquely difficult
- How misaligned tables were corrected
- Limitations of the source data
Helped articulate next-step recommendations, including:
- Deeper trend analysis
- Data completeness improvements
- Follow-up FOIA opportunities

Skills & Impact Demonstrated

Real-world PDF data extraction under ambiguity
Data cleaning and schema standardization
Collaborative data engineering workflows
Client-facing communication and expectation management
Reproducible analytics for investigative journalism
Translating messy government records into public insight

US Military Slot Machine Revenue Explorer