Opened 13 years ago

Closed 10 years ago

Last modified 10 years ago

#1163 closed defect (fixed)

ffprobe can produce invalid XML

Reported by: Ian Ashley Owned by: stefano
Priority: normal Component: ffprobe
Version: git-master Keywords: utf8
Cc: n@ndj7.com, eml+ffmpeg@tupil.com Blocked By:
Blocking: Reproduced by developer: yes
Analyzed by developer: yes

Description

ffprobe can output invalid XML as xml_escape_str only handles < > ' " and &. For example most escape characters below 32 are invalid UTF-8.

This replacement version of the function replaces any invalid UTF-8 characters by the inverted question mark.

static const char *xml_escape_str(char **dst, size_t *dst_size, const char *src,
                                  void *log_ctx)
{
    // the unknown character (inverted question mark)
    const unsigned char BAD_CHARACTER_1 = 194, BAD_CHARACTER_2 = 191;

    const char *p;
    char *q;
    int copyAll = 1;
    size_t size = 1;

    /* precompute size */
    for (p = src; *p;) {
        int badChar = 0;
        unsigned char byte;

        ESCAPE_CHECK_SIZE(src, size, SIZE_MAX-10);

        byte = (unsigned char)*p;
        if (byte < 32 && byte != 9 && byte != 10 && byte != 13) {
            badChar = 1;
            ++p;
        } else if (byte < 128) {
            switch (byte) {
                case '&' : size += 5; /* &amp; */  copyAll = 0; break;
                case '<' : size += 4; /* &lt; */   copyAll = 0; break;
                case '>' : size += 4; /* &gt; */   copyAll = 0; break;
                case '\"': size += 6; /* &quot; */ copyAll = 0; break;
                case '\'': size += 6; /* &apos; */ copyAll = 0; break;
                default: size++;
                }
            ++p;
            ++size;
            }
        else if (byte < 0xC0)
            {
            badChar = 1;
            ++p;
            }
        else
            {
            int extra;

            copyAll = 0;
            if (byte < 0xe0)
                extra = 1;
            else if (byte < 0xf0)
                extra = 2;
            else if (byte < 0xf8)
                extra = 3;
            else
                badChar = 1;

            if (badChar)
                ++p;
            else
                {
                ++p;
                for (int i = 0; i < extra && *p; ++i, ++p)
                    {
                    byte = (unsigned char)*p;
                    if ((byte & 0xc0) != 0x80)
                        badChar = 1;
                    }
                if (!badChar)
                    size += extra;
                }
            }
        if (badChar) {
            size += 2;
            copyAll = 0;
            }
        }

    ESCAPE_REALLOC_BUF(dst_size, dst, src, size);

#define COPY_STR(str) {      \
        const char *s = str; \
        while (*s)           \
            *q++ = *s++;     \
    }

    p = src;
    q = *dst;
    if (copyAll)
        COPY_STR(p)
    else {
        while (*p) {
            int badChar = 0;
            unsigned char byte;
    
            byte = (unsigned char)*p;
            if (byte < 32 && byte != 9 && byte != 10 && byte != 13) {
                badChar = 1;
                ++p;
            } else if (byte < 128) {
                switch (byte) {
                    case '&' : COPY_STR("&amp;");  break;
                    case '<' : COPY_STR("&lt;");   break;
                    case '>' : COPY_STR("&gt;");   break;
                    case '\"': COPY_STR("&quot;"); break;
                    case '\'': COPY_STR("&apos;"); break;
                    default: *q++ = *p;
                    }
                ++p;
                ++size;
                }
            else if (byte < 0xC0)
                {
                badChar = 1;
                ++p;
                }
            else
                {
                int extra;
    
                copyAll = 0;
                if (byte < 0xe0)
                    extra = 1;
                else if (byte < 0xf0)
                    extra = 2;
                else if (byte < 0xf8)
                    extra = 3;
                else
                    badChar = 1;
    
                if (badChar)
                    ++p;
                else
                    {
                    const char *startChar = p;
                    int i;
                    ++p;
                    for (i = 0; i < extra && *p; ++i, ++p)
                        {
                        byte = (unsigned char)*p;
                        if ((byte & 0xc0) != 0x80)
                            badChar = 1;
                        }
                    if (!badChar) {
                        for (i = 0; i < extra;)
                            *q++ = *startChar++;
                    }
                }
            }
            if (badChar) {
                *q++ = BAD_CHARACTER_1;
                *q++ = BAD_CHARACTER_2;
            }
        }
    }
    *q = 0;

    return *dst;
}

Attachments (7)

0001-ffprobe-check-for-errors-and-abort-immediately.patch (10.0 KB ) - added by Stefano Sabatini 11 years ago.
0002-lavu-avstring-add-av_get_utf8-function.patch (5.1 KB ) - added by Stefano Sabatini 11 years ago.
0003-ffprobe-implement-string-validation-policy-setting.patch (10.0 KB ) - added by Stefano Sabatini 11 years ago.
0001-lavu-avstring-add-av_utf8_decode-function.patch (7.2 KB ) - added by Stefano Sabatini 11 years ago.
0002-ffprobe-check-for-errors-and-abort-immediately.patch (9.9 KB ) - added by Stefano Sabatini 11 years ago.
0003-ffprobe-implement-string-validation-policy-setting.2.patch (12.4 KB ) - added by Stefano Sabatini 11 years ago.
1163-utf8bug.rm (100.0 KB ) - added by Carl Eugen Hoyos 11 years ago.

Download all attachments as: .zip

Change History (23)

comment:1 by Carl Eugen Hoyos, 13 years ago

Is the problem also reproducible with current git head?

Please send unified patches to ffmpeg-devel and please provide an example (including console) on how to produce the invalid xml.

comment:2 by Đonny, 12 years ago

Hi, I can reproduce it with ffprobe version N-42704-g85761ef built on Jul 20 2012 20:39:19
Test file can be downloaded here: http://pangea.rferl.org/videoroot/Pangeavideo/2012/08/2/28/28efb444-f6bc-430c-ae8c-50d613f7fde9.mp4

Last edited 12 years ago by Đonny (previous) (diff)

comment:3 by Nathanael Jones, 12 years ago

Cc: n@ndj7.com added

I would be willing to fund development of a patch if it can be completed/integrated soon. Anyone interested?

This is currently blocking our integration of ffmpeg thumbnailing within http://imageresizing.net. A large percentage (~40%) of videos cause ffprobe to produce invalid xml.

in reply to:  3 comment:4 by Carl Eugen Hoyos, 12 years ago

Replying to nathanaeljones:

I would be willing to fund development of a patch if it can be completed/integrated soon. Anyone interested?

First step would be to make this a reproducible ticket by providing a failing command line together with complete, uncut console output (current git head).

in reply to:  3 comment:5 by Stefano Sabatini, 12 years ago

Replying to nathanaeljones:

I would be willing to fund development of a patch if it can be completed/integrated soon. Anyone interested?

This is currently blocking our integration of ffmpeg thumbnailing within http://imageresizing.net. A large percentage (~40%) of videos cause ffprobe to produce invalid xml.

I'm interested in this, but I need more samples. Also the problem seems more related to invalid input data and/or encoding, so it looks more like a libavformat problem. Data should be probably validated and sanitized at the demuxing level, before rendering is performed at the application level.

in reply to:  description comment:6 by Stefano Sabatini, 11 years ago

Replying to Ian:

ffprobe can output invalid XML as xml_escape_str only handles < > ' " and &. For example most escape characters below 32 are invalid UTF-8.

Useful samples are in tickets #2955.

comment:7 by eelco, 11 years ago

Cc: eml+ffmpeg@tupil.com added

comment:8 by Stefano Sabatini, 11 years ago

Analyzed by developer: set
Keywords: utf8 added
Reproduced by developer: set
Status: newopen
Version: 0.10.2unspecified

Please test the patches in attachment and report.

comment:9 by Alex, 11 years ago

Applied on latest version: works like a charm. Thanks @saste!

comment:10 by Alex, 11 years ago

Hm, not so good as I thought. On corrupted files I got segmentation fault.

in reply to:  10 comment:11 by Stefano Sabatini, 11 years ago

Replying to mente:

Hm, not so good as I thought. On corrupted files I got segmentation fault.

Check new patchset, in case of error please provide the sample.

comment:12 by Stefano Sabatini, 11 years ago

Resolution: fixed
Status: openclosed

I believed this should be fixed in:

commit cbba331aa02f29870581ff0b7ded7477b279ae2c
Author: Stefano Sabatini <stefasab@gmail.com>
Date:   Wed Oct 2 16:22:17 2013 +0200

    ffprobe: implement string validation setting
    
    This should fix trac tickets #1163, #2502.

Feel free to test and reopen the ticket in case of issues.

comment:13 by eelco, 11 years ago

Analyzed by developer: unset
Reproduced by developer: unset
Resolution: fixed
Status: closedreopened
Version: unspecifiedgit-master

Uploaded 1163-utf8bug.rm to incoming.

ffprobe -print_format xml -show_format 1163-utf8bug.rm > 1163-utf8bug.xml
ffprobe version N-63665-g921d5ae Copyright (c) 2007-2014 the FFmpeg developers
  built on Jun  1 2014 12:14:19 with Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
  configuration: --prefix=/Users/eelco/Projects/Beamer/FFmpeg/build --disable-shared --enable-static --disable-ffserver --disable-doc --enable-gpl --extra-cflags='-DMACOSX_DEPLOYMENT_TARGET=10.7 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk -mmacosx-version-min=10.7 -I/Users/eelco/Projects/Beamer/FFmpeg/build/include' --extra-ldflags='-isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk -mmacosx-version-min=10.7 -L/Users/eelco/Projects/Beamer/FFmpeg/build/lib'
  libavutil      52. 88.100 / 52. 88.100
  libavcodec     55. 66.100 / 55. 66.100
  libavformat    55. 42.100 / 55. 42.100
  libavdevice    55. 13.101 / 55. 13.101
  libavfilter     4.  5.100 /  4.  5.100
  libswscale      2.  6.100 /  2.  6.100
  libswresample   0. 19.100 /  0. 19.100
  libpostproc    52.  3.100 / 52.  3.100
[rm @ 0x7f9ae1015800] Failed to fully read block
Input #0, rm, from '/Users/eelco/Desktop/1163-utf8bug.rm':
  Metadata:
    Audiences       : BatchRealProducer 's Audience;
    audioMode       : music
    Creation Date   : 2/14/2013 1:40:55
    Description     : This RealMedia File Compressed by Batch Real Producer
    Generated By    : Helix Producer SDK 10.0 for Windows, Build 10.0.0.335
    Modification Date: 2/14/2013 1:40:55
    videoMode       : sharp
    Keywords        : 
    File ID         : 31e4c7ec-29b7-4ffd-8ea5-40f11991751a
    title           : tsks
    author          : ????????TSKS??2014??
    copyright       : ??????ʹTSKS???????tskscn.com??
    comment         : 
  Duration: 00:59:01.08, start: 0.000000, bitrate: 0 kb/s
    Stream #0:0: Audio: cook (cook / 0x6B6F6F63), 44100 Hz, stereo, fltp, 44 kb/s
    Stream #0:1: Video: rv40 (RV40 / 0x30345652), yuv420p, 624x352, 536 kb/s, 29.97 fps, 29.97 tbr, 1k tbn, 1k tbc
[xml @ 0x7f9ae1012e00] 10 invalid UTF-8 sequence(s) found in string '????????TSKS??2014??', replaced with '�'
[xml @ 0x7f9ae1012e00] 18 invalid UTF-8 sequence(s) found in string '??????ʹTSKS???????tskscn.com??', replaced with '�'
xmllint 1163-utf8bug.xml 
1163-utf8bug.xml:14: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC0 0xA1 0xEF 0xBF
        <tag key="author" value="�����??�TSKS��2014��"/>
                                                ^

by Carl Eugen Hoyos, 11 years ago

Attachment: 1163-utf8bug.rm added

in reply to:  13 ; comment:14 by Stefano Sabatini, 10 years ago

Replying to eelco:

Uploaded 1163-utf8bug.rm to incoming.

[...]

xmllint 1163-utf8bug.xml 
1163-utf8bug.xml:14: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xC0 0xA1 0xEF 0xBF
        <tag key="author" value="�����??�TSKS��2014��"/>
                                                ^

C0A1 is an overlong encoding for character ! (decimal number 33), using two bytes in place of one. Since this is not allowed we should probably fix the decoder to disallow it.

Last edited 10 years ago by Stefano Sabatini (previous) (diff)

in reply to:  14 comment:15 by Stefano Sabatini, 10 years ago

Analyzed by developer: set
Component: ffprobeavutil
Reproduced by developer: set
Resolution: fixed
Status: reopenedclosed

Replying to saste:
[...]

C0A1 is an overlong encoding for character ! (decimal number 33), using two bytes in place of one. Since this is not allowed we should probably fix the decoder to disallow it.

Should be fixed in:

commit d4ec07dfe7dbc86e8f6403781c511b9463a526d2
Author: Stefano Sabatini <stefasab@gmail.com>
Date:   Thu Aug 28 17:37:27 2014 +0200

    lavu/avstring: check for overlong encodings in av_utf8_decode()
    
    Fix reopened trac ticket #1163.

comment:16 by Stefano Sabatini, 10 years ago

Component: avutilffprobe
Note: See TracTickets for help on using tickets.