← Back to Data Science
IDI Logistics
Generative AIAdvanced Web Scraping

Automating government data workflows with AI

Replacing hundreds of hours of manual data parsing with AI-powered web scraping and automated intelligence delivery.

100+

Government websites scraped weekly

6

Parallel scraping batches

Monday 8 AM

Automated delivery schedule

Hundreds of hours

Manual work eliminated

About IDI Logistics

IDI Logistics is a logistics and real estate firm that tracks government data across municipalities and counties to identify potential deals. The company was spending hundreds of hours manually parsing data from public websites — time that could be redirected to higher-value deal analysis and relationship building.

Situation

IDI was spending hundreds of hours manually parsing data on potential deals from over 100 unique government entity websites. Their team needed to monitor municipalities and counties for newly published content — meeting minutes, zoning changes, and other public records — that could indicate upcoming development opportunities.

The manual process was not only time-intensive but also error-prone, with relevant documents frequently missed due to the sheer volume of content across disparate government portals.

Solution

Echelon developed a custom application that scrapes 100+ unique government entity websites and sends AI-generated summaries of all relevant newly published content. The system was designed for scale, robustness, and full automation.

The infrastructure begins with targeted web crawlers built on Beautiful Soup, Requests, and Selenium 4 (for JavaScript-rendered portals), running on AWS EC2 with S3 storage. The scraper compares download history against new files to avoid redundant retrieval, and entities are batched into parallel groups — a design that supports national-scale buildout without increasing runtimes.

A weekly cron job triggers scraper runs with built-in random delays to avoid rate limiting. Error monitoring via a custom Slack bot tracks runtime statistics and catches issues before summaries reach the client. The scraper continues through errors on individual pages to maintain robustness.

New documents (PDFs, text, HTML, DOCs) are processed into text and assessed for relevancy by GPT-4o via the OpenAI API, using client-specific keywords and zoning codes. Relevant documents are summarized and packaged with metadata — source URL, county name, document type, and government entity — then formatted into HTML reports clustered by entity for readability.

The final output is delivered via automated email: a test email goes out internally on Sundays, and the production version reaches the client Monday at 8 AM.

Technical Architecture

The system is built as an embarrassingly parallel pipeline: entities are batched into groups with scraping parallelized across them. Storage is handled via Boto3 with direct memory-to-S3 streaming to minimize unnecessary reads and writes.

LLM orchestration uses LangChain to pipe text through GPT-4o with engineered prompts for relevance assessment and summarization. Output is structured as dated JSON files with full metadata, then converted to formatted HTML via Markdown for email delivery through the Gmail API.

All development was managed in a shared GitHub repository with EC2 deployment via SSH. The cron job configuration is maintained separately from the application code for operational flexibility.

Have a similar challenge?

Let’s Build Something Together

Tell us about your problem and we’ll set up a time to talk.