Opened 12 years ago

Closed 11 years ago

#2502 closed defect (fixed)

ffprobe Produces Invalid JSON

Reported by: dnicolson Owned by:
Priority: normal Component: ffprobe
Version: unspecified Keywords: utf8
Cc: eml+ffmpeg@tupil.com Blocked By:
Blocking: Reproduced by developer: yes
Analyzed by developer: yes

Description

When running ffprobe with the -print_format switch as json, it can produce invalid Unicode escaping. Snippet of JSON code attached to the ticket.

Environment:
ffmpeg 1.2 installed via Homebrew.

Python code used to call ffsnoop:
cmnd = ['ffprobe', '-v', 'quiet', '-print_format', 'json', '-show_format', '-show_streams', path]
p = subprocess.Popen(cmnd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate()
io = StringIO(out)
info = json.load(io)

Python code to reproduce with attached JSON:
json.load(StringIO(file("output.json").read()))

Attachments (3)

output.json (222 bytes ) - added by dnicolson 12 years ago.
output2.json (3.0 KB ) - added by dnicolson 12 years ago.
test-pattern.avi (280.8 KB ) - added by dnicolson 11 years ago.

Download all attachments as: .zip

Change History (20)

by dnicolson, 12 years ago

Attachment: output.json added

comment:1 by dnicolson, 12 years ago

ffsnoop was meant to be ffprobe.

comment:2 by Carl Eugen Hoyos, 12 years ago

Component: undeterminedFFprobe

Could you provide a sample command line that produces the invalid output?
Is a specific sample necessary?

comment:3 by dnicolson, 12 years ago

ffprobe -v quiet -print_format json -show_format -show_streams "/path/to/media"

The invalid characters are displayed as ' æ' in the artist field when viewing the media information in VLC.

comment:4 by Carl Eugen Hoyos, 12 years ago

Sorry, I forgot to ask for the complete, uncut console output: It is needed for all tickets to make them valid.

If I understand correctly that a specific sample is needed to reproduce the problem, please provide it.

by dnicolson, 12 years ago

Attachment: output2.json added

comment:5 by dnicolson, 12 years ago

The entire stdout has been attached. This was a different video, but the same character 'æ' was found in the metadata.

comment:6 by Cigaes, 12 years ago

Unless you provide the input file, or at least a sufficient part of it, it is not possible to determine if the problem comes from an invalid file or a bug in ffmpeg.

comment:7 by Carl Eugen Hoyos, 12 years ago

You already provided the relevant part of the JSON output, please provide ffprobe's console output to make this a valid ticket.

comment:8 by dnicolson, 12 years ago

Making a reduced case by reducing the size of the video could help, but surely it's still a bug if ffprobe is outputting invalid JSON no matter what kind of file is passed to it?

In any case, I just researched potential invalid JSON cases and can confirm that this can be fixed by outputting JSON in valid UTF-8. The attached ffprobe output was encoded in ISO-8859-1.

comment:9 by Cigaes, 12 years ago

The way of fixing the bug is completely different whether the file is invalid or not. Please provide a file.

comment:10 by Clément Bœsch, 12 years ago

It's possible the input file has some metadata in ISO-8859-1 which are not converted to UTF-8 by FFmpeg (metadata are assumed to be UTF-8) and thus cause an invalid output in ffprobe. Maybe non-valid UTF-8 character could be removed/replaced in ffprobe, but that won't solve the real problem ("just" avoid invalid json outputs). OTOH, metadata re-encoding could be supported in FFmpeg. Anyway, a sample would be welcome.

comment:11 by dnicolson, 12 years ago

I'm not able to provide a reference file unfortunately. I'm not sure how avi files store their metadata, but all three instances of the problem were AVI files.

Perhaps a byte range check would be sufficient enough, not too dissimilar to how the Terminal on OS X handles raw stdout.

comment:12 by Stefano Sabatini, 12 years ago

May be related to ticket #1163.

The problem looks like the file contains invalid non UTF-8 data.

comment:13 by Luke Quinane, 12 years ago

This file seems to have the same problem outputting '³' (\u00b3) in the author field ("Bl³mchen").

http://samples.mplayerhq.hu/vqf/handinha.vqf

Also does ffprobe set the console output page to UTF-8 when it starts up (via SetConsoleOutputCP)?

by dnicolson, 11 years ago

Attachment: test-pattern.avi added

comment:14 by dnicolson, 11 years ago

I have made a reduced case and attached a file (test-pattern.avi), as requested.

I created an AVI file with ffmpeg using the following command:

ffmpeg -i test-pattern-orig.avi -metadata title="æ" -metadata artist="echo -e \"\xe6\"" -vcodec copy -acodec copy test-pattern.avi
(backticks need to be added around the monospaced text).

This creates the file test-pattern.avi with the title as a UTF-8 encoded lowercase AE and the artist as a ISO-8859-1 encoded lowercase AE. VLC displays metadata in ISO-8859-1 so the artist is correctly displayed as "æ" but displays the title as "æ".

Because ffprobe assumes all valid UTF-8 in the metadata, the following command produces invalid JSON:

ffprobe -v quiet -print_format json -show_format -show_streams test-pattern.avi | python -c 'import json,sys; json.load(sys.stdin)'

A possible solution would be to strip invalid UTF-8 characters, or maybe provide an alternate switch to replace invalid characters?

comment:15 by eelco, 11 years ago

Cc: eml+ffmpeg@tupil.com added

in reply to:  14 comment:16 by Stefano Sabatini, 11 years ago

Analyzed by developer: set
Keywords: utf8 added
Reproduced by developer: set
Status: newopen

Replying to dnicolson:

I have made a reduced case and attached a file (test-pattern.avi), as requested.

I created an AVI file with ffmpeg using the following command:

ffmpeg -i test-pattern-orig.avi -metadata title="æ" -metadata artist="echo -e \"\xe6\"" -vcodec copy -acodec copy test-pattern.avi
(backticks need to be added around the monospaced text).

This creates the file test-pattern.avi with the title as a UTF-8 encoded lowercase AE and the artist as a ISO-8859-1 encoded lowercase AE. VLC displays metadata in ISO-8859-1 so the artist is correctly displayed as "æ" but displays the title as "æ".

AE in ISO8859-1 = 0xE6
AE in UTF-8 = 0xC386

As a consequence, AE encoded in UTF-8 will render in IS08859-1 as two distinct characters, and ISO8859-1 AE will not correspond to a valid UTF-8 sequence.

Now the problem is to understand what's the reference encoding. FFmpeg always assumes UTF-8, so you should provide metadata encoded in UTF-8 format. Note that your command is broken since you're explicitly passing an invalid UTF-8 sequence to the metadata option (which expects UTF-8 data).

Currently there is no way to specify (nor autodetect) the assumed encoding.

Because ffprobe assumes all valid UTF-8 in the metadata, the following command produces invalid JSON:

ffprobe -v quiet -print_format json -show_format -show_streams test-pattern.avi | python -c 'import json,sys; json.load(sys.stdin)'

A possible solution would be to strip invalid UTF-8 characters, or maybe provide an alternate switch to replace invalid characters?

Implemented in an experimental patchset, see ticket #1163.

comment:17 by Stefano Sabatini, 11 years ago

Resolution: fixed
Status: openclosed

It should be fixed in:

commit cbba331aa02f29870581ff0b7ded7477b279ae2c
Author: Stefano Sabatini <stefasab@gmail.com>
Date:   Wed Oct 2 16:22:17 2013 +0200

    ffprobe: implement string validation setting
    
    This should fix trac tickets #1163, #2502.

Feel free to test and reopen in case of issues.

Note: See TracTickets for help on using tickets.