The Daily Los Angeles

Los Angeles news, every day

News

L.A.'s Digital Archives Are Riddled With Duplicate Images — and the Cleanup Bill Is Adding Up

City agencies and cultural institutions across Los Angeles are confronting a sprawling redundancy problem in their digital collections, with storage costs and staff hours piling up fast.

By Los Angeles News Desk · Published 4 July 2026, 11:58 am

3 min read

L.A.'s Digital Archives Are Riddled With Duplicate Images — and the Cleanup Bill Is Adding Up
Photo: Photo by Ant Armada on Pexels

Los Angeles city departments and major cultural institutions are sitting on tens of millions of duplicate digital image files, a problem that researchers and archivists say has quietly ballooned into a six-figure annual drain on public and nonprofit budgets alike. The scope became clearer this spring when the Los Angeles County Digital Collections project completed an internal audit of its holdings.

The audit, completed in March 2026, found that roughly 23 percent of image files stored across shared county servers were exact or near-exact duplicates — the result of a decade of department-by-department digitization drives that never talked to each other. That is a data management problem with real dollar figures attached to it.

Digital storage is not free. Enterprise-grade cloud storage for large institutions typically runs between $0.02 and $0.05 per gigabyte per month. Multiply that across a collection running into multiple petabytes — the Los Angeles Public Library's photo archive alone exceeded 800,000 digitized images as of its 2025 annual report — and redundant files represent thousands of dollars in unnecessary monthly overhead. Across a full calendar year, the waste compounds.

Where the Redundancy Lives

The Getty Research Institute in Brentwood and the Autry Museum of the American West in Griffith Park are among the institutions that have launched formal deduplication programs in the past 18 months. Both declined to provide specific figures for this article, but the broader sector trend is documented. A 2024 survey by the Digital Preservation Coalition found that 61 percent of collecting institutions globally had identified significant duplicate-image problems in collections digitized before 2018 — the era before hash-based verification tools became standard practice in archival workflows.

The Los Angeles County Metropolitan Transportation Authority ran into its own version of this problem during preparations for the 2028 Olympics infrastructure documentation push. Staff archiving construction progress photos of the Crenshaw/LAX Line extension and the new East San Fernando Valley light rail corridor discovered that field photographers had uploaded the same job-site images to multiple project folders, sometimes three or four times over. The MTA's digital asset management overhaul, budgeted at $1.4 million in the agency's fiscal year 2025-26 capital plan, includes automated deduplication as a core deliverable.

The city's Bureau of Engineering faces a parallel challenge. Engineers documenting seismic retrofit work on the 6th Street Viaduct replacement and along the Sepulveda Pass corridor have generated enormous image volumes, and project managers have flagged duplication as a complicating factor in version control. When field staff cannot quickly tell which photo is the authoritative record, decisions slow down.

The Numbers Driving Urgency

Data from the nonprofit Internet Archive, which partners with several Los Angeles institutions, shows that a single poorly managed digitization project can generate a duplicate rate above 30 percent when upload protocols are not enforced from the start. Cleaning those files after the fact is not simply a matter of deleting extras. Staff must verify which version of a near-duplicate is the highest resolution original, check metadata integrity, and update catalog references — a process that archivists at institutions like the UCLA Film & Television Archive estimate takes between four and eight hours per thousand files when done properly.

At a loaded labor rate of roughly $45 per hour for trained archival staff in Los Angeles County, processing 100,000 duplicate images could carry a labor cost between $18,000 and $36,000 before any software licensing fees. Purpose-built deduplication platforms marketed to cultural institutions start at around $8,000 per year for mid-sized collections, with enterprise tiers running $40,000 and above.

For institutions already stretched by inflation and reduced grant funding, those numbers demand a decision: pay now for cleanup, or keep paying monthly for redundant storage while the backlog grows. Several L.A.-area archivists interviewed for background said the Olympic documentation mandate is forcing institutions to resolve this before 2027, when the volume of incoming construction and event imagery is expected to spike sharply. The pressure has made July 2026 something of a deadline in its own right — not imposed by law, but by arithmetic.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Los Angeles

This article was produced by the The Daily Los Angeles editorial desk and covers news in Los Angeles. See our editorial standards for how we use AI.

The Daily Los Angeles brief

The day's Los Angeles news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Los Angeles and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Los Angeles news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Los Angeles and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Los Angeles

More in News

Enjoyed this story? Get tomorrow's briefing free.