Opened 12 years ago
Closed 11 years ago
#2502 closed defect (fixed)
ffprobe Produces Invalid JSON
Reported by: | dnicolson | Owned by: | |
---|---|---|---|
Priority: | normal | Component: | ffprobe |
Version: | unspecified | Keywords: | utf8 |
Cc: | eml+ffmpeg@tupil.com | Blocked By: | |
Blocking: | Reproduced by developer: | yes | |
Analyzed by developer: | yes |
Description
When running ffprobe with the -print_format switch as json, it can produce invalid Unicode escaping. Snippet of JSON code attached to the ticket.
Environment:
ffmpeg 1.2 installed via Homebrew.
Python code used to call ffsnoop:
cmnd = ['ffprobe', '-v', 'quiet', '-print_format', 'json', '-show_format', '-show_streams', path]
p = subprocess.Popen(cmnd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate()
io = StringIO(out)
info = json.load(io)
Python code to reproduce with attached JSON:
json.load(StringIO(file("output.json").read()))
Attachments (3)
Change History (20)
by , 12 years ago
Attachment: | output.json added |
---|
comment:1 by , 12 years ago
comment:2 by , 12 years ago
Component: | undetermined → FFprobe |
---|
Could you provide a sample command line that produces the invalid output?
Is a specific sample necessary?
comment:3 by , 12 years ago
ffprobe -v quiet -print_format json -show_format -show_streams "/path/to/media"
The invalid characters are displayed as ' æ' in the artist field when viewing the media information in VLC.
comment:4 by , 12 years ago
Sorry, I forgot to ask for the complete, uncut console output: It is needed for all tickets to make them valid.
If I understand correctly that a specific sample is needed to reproduce the problem, please provide it.
by , 12 years ago
Attachment: | output2.json added |
---|
comment:5 by , 12 years ago
The entire stdout has been attached. This was a different video, but the same character 'æ' was found in the metadata.
comment:6 by , 12 years ago
Unless you provide the input file, or at least a sufficient part of it, it is not possible to determine if the problem comes from an invalid file or a bug in ffmpeg.
comment:7 by , 12 years ago
You already provided the relevant part of the JSON output, please provide ffprobe's console output to make this a valid ticket.
comment:8 by , 12 years ago
Making a reduced case by reducing the size of the video could help, but surely it's still a bug if ffprobe is outputting invalid JSON no matter what kind of file is passed to it?
In any case, I just researched potential invalid JSON cases and can confirm that this can be fixed by outputting JSON in valid UTF-8. The attached ffprobe output was encoded in ISO-8859-1.
comment:9 by , 12 years ago
The way of fixing the bug is completely different whether the file is invalid or not. Please provide a file.
comment:10 by , 12 years ago
It's possible the input file has some metadata in ISO-8859-1 which are not converted to UTF-8 by FFmpeg (metadata are assumed to be UTF-8) and thus cause an invalid output in ffprobe. Maybe non-valid UTF-8 character could be removed/replaced in ffprobe, but that won't solve the real problem ("just" avoid invalid json outputs). OTOH, metadata re-encoding could be supported in FFmpeg. Anyway, a sample would be welcome.
comment:11 by , 12 years ago
I'm not able to provide a reference file unfortunately. I'm not sure how avi files store their metadata, but all three instances of the problem were AVI files.
Perhaps a byte range check would be sufficient enough, not too dissimilar to how the Terminal on OS X handles raw stdout.
comment:12 by , 12 years ago
May be related to ticket #1163.
The problem looks like the file contains invalid non UTF-8 data.
comment:13 by , 11 years ago
This file seems to have the same problem outputting '³' (\u00b3) in the author field ("Bl³mchen").
http://samples.mplayerhq.hu/vqf/handinha.vqf
Also does ffprobe set the console output page to UTF-8 when it starts up (via SetConsoleOutputCP)?
by , 11 years ago
Attachment: | test-pattern.avi added |
---|
follow-up: 16 comment:14 by , 11 years ago
I have made a reduced case and attached a file (test-pattern.avi), as requested.
I created an AVI file with ffmpeg using the following command:
ffmpeg -i test-pattern-orig.avi -metadata title="æ" -metadata artist="echo -e \"\xe6\"
" -vcodec copy -acodec copy test-pattern.avi
(backticks need to be added around the monospaced text).
This creates the file test-pattern.avi with the title as a UTF-8 encoded lowercase AE and the artist as a ISO-8859-1 encoded lowercase AE. VLC displays metadata in ISO-8859-1 so the artist is correctly displayed as "æ" but displays the title as "æ".
Because ffprobe assumes all valid UTF-8 in the metadata, the following command produces invalid JSON:
ffprobe -v quiet -print_format json -show_format -show_streams test-pattern.avi | python -c 'import json,sys; json.load(sys.stdin)'
A possible solution would be to strip invalid UTF-8 characters, or maybe provide an alternate switch to replace invalid characters?
comment:15 by , 11 years ago
Cc: | added |
---|
comment:16 by , 11 years ago
Analyzed by developer: | set |
---|---|
Keywords: | utf8 added |
Reproduced by developer: | set |
Status: | new → open |
Replying to dnicolson:
I have made a reduced case and attached a file (test-pattern.avi), as requested.
I created an AVI file with ffmpeg using the following command:
ffmpeg -i test-pattern-orig.avi -metadata title="æ" -metadata artist="
echo -e \"\xe6\"
" -vcodec copy -acodec copy test-pattern.avi
(backticks need to be added around the monospaced text).
This creates the file test-pattern.avi with the title as a UTF-8 encoded lowercase AE and the artist as a ISO-8859-1 encoded lowercase AE. VLC displays metadata in ISO-8859-1 so the artist is correctly displayed as "æ" but displays the title as "æ".
AE in ISO8859-1 = 0xE6
AE in UTF-8 = 0xC386
As a consequence, AE encoded in UTF-8 will render in IS08859-1 as two distinct characters, and ISO8859-1 AE will not correspond to a valid UTF-8 sequence.
Now the problem is to understand what's the reference encoding. FFmpeg always assumes UTF-8, so you should provide metadata encoded in UTF-8 format. Note that your command is broken since you're explicitly passing an invalid UTF-8 sequence to the metadata option (which expects UTF-8 data).
Currently there is no way to specify (nor autodetect) the assumed encoding.
Because ffprobe assumes all valid UTF-8 in the metadata, the following command produces invalid JSON:
ffprobe -v quiet -print_format json -show_format -show_streams test-pattern.avi | python -c 'import json,sys; json.load(sys.stdin)'
A possible solution would be to strip invalid UTF-8 characters, or maybe provide an alternate switch to replace invalid characters?
Implemented in an experimental patchset, see ticket #1163.
comment:17 by , 11 years ago
Resolution: | → fixed |
---|---|
Status: | open → closed |
It should be fixed in:
commit cbba331aa02f29870581ff0b7ded7477b279ae2c Author: Stefano Sabatini <stefasab@gmail.com> Date: Wed Oct 2 16:22:17 2013 +0200 ffprobe: implement string validation setting This should fix trac tickets #1163, #2502.
Feel free to test and reopen in case of issues.
ffsnoop was meant to be ffprobe.