SpeechVerifier: Robust Acoustic Fingerprint against Tampering Attacks via Watermarking

Anonymous Authors

Overview illustration
SpeechVerifier system overview

Abstract

Advances in audio editing have made public speeches vulnerable to malicious tampering, raising concerns about social trust. Existing detection methods remain insufficient: they either rely on external references or fail to balance sensitivity to attacks with robustness against benign operations like compression. To address these challenges, we propose SpeechVerifier, the first learning-based self-contained speech integrity verification framework. SpeechVerifier employs a decoupled fingerprint--watermark architecture: a multiscale feature extractor captures speech characteristics across different temporal resolutions, and contrastive learning generates fingerprints that remain stable under benign operations yet change significantly under malicious tampering. These fingerprints are embedded into the audio via robust watermarking, enabling direct verification without external references. Extensive experiments demonstrate that SpeechVerifier reliably detects tampering while maintaining robustness against common benign operations. Real-world evaluations further confirm its effectiveness in verifying speech integrity.

Real-world dataset examples

Original Audio (Audio1)

Transcript: "The board has decided they can not approve the new budget."

Hamming Distance: 12
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio1, Deletion)

Transcript: "The board has decided they can not approve the new budget."

Hamming Distance: 131
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio2)

Transcript: "Our analysis shows this investment is not a secure option."

Hamming Distance: 6
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio2, Silencing)

Transcript: "Our analysis shows this investment is not a secure option."

Hamming Distance: 107
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio3)

Transcript: "Based on the evidence, the suspect is innocent."

Hamming Distance: 1
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio3, Substitution)

Transcript: "Based on the evidence, the suspect is guilty."

Hamming Distance: 123
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio4)

Transcript: "Based on the evidence, the suspect is guilty."

Hamming Distance: 22
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio4, Substitution)

Transcript: "Based on the evidence, the suspect is innocent."

Hamming Distance: 118
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio5)

Transcript: "I never said she stole the company's data."

Hamming Distance: 11
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio5, Reordering)

Transcript: "She stole the company's data, I never said."

Hamming Distance: 128
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio6)

Transcript: "We will begin the product launch immediately."

Hamming Distance: 15
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio6, Substitution)

Transcript: "We will delay the product launch immediately."

Hamming Distance: 100
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio7)

Transcript: "We will delay the product launch immediately."

Hamming Distance: 9
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio7, Substitution)

Transcript: "We will begin the product launch immediately."

Hamming Distance: 121
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio8)

Transcript: "I believe it's a good idea, but we need more time."

Hamming Distance: 2
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio8, Splicing)

Transcript: "I never said I believe it's a good idea, but we need more time."

Hamming Distance: 125
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio9)

Transcript: "I never said she stole the company's data."

Hamming Distance: 15
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio9, Text-to-Speech)

Transcript: "This is authentic audio, not deepfake."

Hamming Distance: 135
(Threshold: 42 → Verdict: TAMPERED)

Original Audio (Audio10)

Transcript: "I never said she stole the company's data."

Hamming Distance: 21
(Threshold: 42 → Verdict: LEGIT)

Tampered Audio (Audio10, Voice Conversion)

Transcript: "I never said she stole the company's data."

Note: Voice Timbre Changed

Hamming Distance: 126
(Threshold: 42 → Verdict: TAMPERED)