Video Processor

A Video Processor is a utility responsible for preparing input features for video models, as well as handling the post-processing of their outputs. It provides transformations such as resizing, normalization, and conversion into PyTorch.

The video processor extends the functionality of image processors by allowing Vision Large Language Models (VLMs) to handle videos with a distinct set of arguments compared to images. It serves as the bridge between raw video data and the model, ensuring that input features are optimized for the VLM.

When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named video_preprocessing_config.json. Don’t worry if you haven’t upadted your VLM, the processor will try to load video related configurations from a file named preprocessing_config.json.

Usage Example

Here’s an example of how to load a video processor with llava-hf/llava-onevision-qwen2-0.5b-ov-hf model:

from transformers import AutoVideoProcessor

processor = AutoVideoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")

Currently, if using base image processor for videos, it processes video data by treating each frame as an individual image and applying transformations frame-by-frame. While functional, this approach is not highly efficient. Using AutoVideoProcessor allows us to take advantage of fast video processors, leveraging the torchvision library. Fast processors handle the whole batch of videos at once, without iterating over each video or frame. These updates introduce GPU acceleration and significantly enhance processing speed, especially for tasks requiring high throughput.

Fast video processors are available for all models and are loaded by default when an AutoVideoProcessor is initialized. When using a fast video processor, you can also set the device argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise. For even more speed improvement, we can compile the processor when using ‘cuda’ as device.

import torch
from transformers.video_utils import load_video
from transformers import AutoVideoProcessor

video = load_video("video.mp4")
processor = AutoVideoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf", device="cuda")
processor = torch.compile(processor)
processed_video = processor(video, return_tensors="pt")

BaseVideoProcessor

class transformers.BaseVideoProcessor

< source >

( **kwargs: typing_extensions.Unpack[transformers.processing_utils.VideosKwargs] )

Parameters

do_resize (bool, optional, defaults to self.do_resize) — Whether to resize the video’s (height, width) dimensions to the specified size. Can be overridden by the do_resize parameter in the preprocess method.
size (dict, optional, defaults to self.size) — Size of the output video after resizing. Can be overridden by the size parameter in the preprocess method.
size_divisor (int, optional, defaults to self.size_divisor) — The size by which to make sure both the height and width can be divided.
default_to_square (bool, optional, defaults to self.default_to_square) — Whether to default to a square video when resizing, if size is an int.
resample (PILImageResampling, optional, defaults to self.resample) — Resampling filter to use if resizing the video. Only has an effect if do_resize is set to True. Can be overridden by the resample parameter in the preprocess method.
do_center_crop (bool, optional, defaults to self.do_center_crop) — Whether to center crop the video to the specified crop_size. Can be overridden by do_center_crop in the preprocess method.
do_pad (bool, optional) — Whether to pad the video to the (max_height, max_width) of the videos in the batch.
crop_size (Dict[str, int] optional, defaults to self.crop_size) — Size of the output video after applying center_crop. Can be overridden by crop_size in the preprocess method.
do_rescale (bool, optional, defaults to self.do_rescale) — Whether to rescale the video by the specified scale rescale_factor. Can be overridden by the do_rescale parameter in the preprocess method.
rescale_factor (int or float, optional, defaults to self.rescale_factor) — Scale factor to use if rescaling the video. Only has an effect if do_rescale is set to True. Can be overridden by the rescale_factor parameter in the preprocess method.
do_normalize (bool, optional, defaults to self.do_normalize) — Whether to normalize the video. Can be overridden by the do_normalize parameter in the preprocess method. Can be overridden by the do_normalize parameter in the preprocess method.
image_mean (float or List[float], optional, defaults to self.image_mean) — Mean to use if normalizing the video. This is a float or list of floats the length of the number of channels in the video. Can be overridden by the image_mean parameter in the preprocess method. Can be overridden by the image_mean parameter in the preprocess method.
image_std (float or List[float], optional, defaults to self.image_std) — Standard deviation to use if normalizing the video. This is a float or list of floats the length of the number of channels in the video. Can be overridden by the image_std parameter in the preprocess method. Can be overridden by the image_std parameter in the preprocess method.
do_convert_rgb (bool, optional, defaults to self.image_std) — Whether to convert the video to RGB.
return_tensors (str or TensorType, optional) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors.
data_format (ChannelDimension or str, optional, defaults to ChannelDimension.FIRST) — The channel dimension format for the output video. Can be one of:
- "channels_first" or ChannelDimension.FIRST: video in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: video in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input video.
input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input video. If unset, the channel dimension format is inferred from the input video. Can be one of:
- "channels_first" or ChannelDimension.FIRST: video in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: video in (height, width, num_channels) format.
- "none" or ChannelDimension.NONE: video in (height, width) format.
device (torch.device, optional) — The device to process the videos on. If unset, the device is inferred from the input videos.

Constructs a base VideoProcessor.

convert_to_rgb

< source >

( video: torch.Tensor ) → torch.Tensor

Parameters

video ("torch.Tensor") — The video to convert.

Returns

torch.Tensor

The converted video.

Converts a video to RGB format.

fetch_videos

< source >

( video_url_or_urls: typing.Union[str, typing.List[str]] )

Convert a single or a list of urls into the corresponding np.array objects.

If a single url is passed, the return value will be a single object. If a list is passed a list of objects is returned.

from_dict

< source >

( video_processor_dict: typing.Dict[str, typing.Any] **kwargs ) → ~video_processing_utils.VideoProcessorBase

Parameters

video_processor_dict (Dict[str, Any]) — Dictionary that will be used to instantiate the video processor object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the ~video_processing_utils.VideoProcessorBase.to_dict method.
kwargs (Dict[str, Any]) — Additional parameters from which to initialize the video processor object.

Returns

~video_processing_utils.VideoProcessorBase

The video processor object instantiated from those parameters.

Instantiates a type of ~video_processing_utils.VideoProcessorBase from a Python dictionary of parameters.

from_json_file

< source >

( json_file: typing.Union[str, os.PathLike] ) → A video processor of type ~video_processing_utils.VideoProcessorBase

Parameters

json_file (str or os.PathLike) — Path to the JSON file containing the parameters.

Returns

A video processor of type ~video_processing_utils.VideoProcessorBase

The video_processor object instantiated from that JSON file.

Instantiates a video processor of type ~video_processing_utils.VideoProcessorBase from the path to a JSON file of parameters.

from_pretrained

< source >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike] cache_dir: typing.Union[str, os.PathLike, NoneType] = None force_download: bool = False local_files_only: bool = False token: typing.Union[str, bool, NoneType] = None revision: str = 'main' **kwargs )

Parameters

pretrained_model_name_or_path (str or os.PathLike) — This can be either:
- a string, the model id of a pretrained video hosted inside a model repo on huggingface.co.
- a path to a directory containing a video processor file saved using the ~video_processing_utils.VideoProcessorBase.save_pretrained method, e.g., ./my_model_directory/.
- a path or url to a saved video processor JSON file, e.g., ./my_model_directory/preprocessor_config.json.
cache_dir (str or os.PathLike, optional) — Path to a directory in which a downloaded pretrained model video processor should be cached if the standard cache should not be used.
force_download (bool, optional, defaults to False) — Whether or not to force to (re-)download the video processor files and override the cached versions if they exist.
resume_download — Deprecated and ignored. All downloads are now resumed by default when possible. Will be removed in v5 of Transformers.
proxies (Dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request.
token (str or bool, optional) — The token to use as HTTP bearer authorization for remote files. If True, or not specified, will use the token generated when running huggingface-cli login (stored in ~/.huggingface).
revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.

Instantiate a type of ~video_processing_utils.VideoProcessorBase from an video processor.

Examples:

# We can't instantiate directly the base class *VideoProcessorBase* so let's show the examples on a
# derived class: *LlavaOnevisionVideoProcessor*
video_processor = LlavaOnevisionVideoProcessor.from_pretrained(
    "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
)  # Download video_processing_config from huggingface.co and cache.
video_processor = LlavaOnevisionVideoProcessor.from_pretrained(
    "./test/saved_model/"
)  # E.g. video processor (or model) was saved using *save_pretrained('./test/saved_model/')*
video_processor = LlavaOnevisionVideoProcessor.from_pretrained("./test/saved_model/preprocessor_config.json")
video_processor = LlavaOnevisionVideoProcessor.from_pretrained(
    "llava-hf/llava-onevision-qwen2-0.5b-ov-hf", do_normalize=False, foo=False
)
assert video_processor.do_normalize is False
video_processor, unused_kwargs = LlavaOnevisionVideoProcessor.from_pretrained(
    "llava-hf/llava-onevision-qwen2-0.5b-ov-hf", do_normalize=False, foo=False, return_unused_kwargs=True
)
assert video_processor.do_normalize is False
assert unused_kwargs == {"foo": False}

get_video_processor_dict

< source >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike] **kwargs ) → Tuple[Dict, Dict]

Parameters

pretrained_model_name_or_path (str or os.PathLike) — The identifier of the pre-trained checkpoint from which we want the dictionary of parameters.
subfolder (str, optional, defaults to "") — In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.

Returns

Tuple[Dict, Dict]

The dictionary(ies) that will be used to instantiate the video processor object.

From a pretrained_model_name_or_path, resolve to a dictionary of parameters, to be used for instantiating a video processor of type ~video_processing_utils.VideoProcessorBase using from_dict.

preprocess

< source >

( videos: typing.Union[typing.List[ForwardRef('PIL.Image.Image')], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), typing.List[ForwardRef('np.ndarray')], typing.List[ForwardRef('torch.Tensor')], typing.List[typing.List[ForwardRef('PIL.Image.Image')]], typing.List[typing.List[ForwardRef('np.ndarrray')]], typing.List[typing.List[ForwardRef('torch.Tensor')]]] **kwargs: typing_extensions.Unpack[transformers.processing_utils.VideosKwargs] )

Parameters

do_resize (bool, optional, defaults to self.do_resize) — Whether to resize the video’s (height, width) dimensions to the specified size. Can be overridden by the do_resize parameter in the preprocess method.
size (dict, optional, defaults to self.size) — Size of the output video after resizing. Can be overridden by the size parameter in the preprocess method.
size_divisor (int, optional, defaults to self.size_divisor) — The size by which to make sure both the height and width can be divided.
default_to_square (bool, optional, defaults to self.default_to_square) — Whether to default to a square video when resizing, if size is an int.
resample (PILImageResampling, optional, defaults to self.resample) — Resampling filter to use if resizing the video. Only has an effect if do_resize is set to True. Can be overridden by the resample parameter in the preprocess method.
do_center_crop (bool, optional, defaults to self.do_center_crop) — Whether to center crop the video to the specified crop_size. Can be overridden by do_center_crop in the preprocess method.
do_pad (bool, optional) — Whether to pad the video to the (max_height, max_width) of the videos in the batch.
crop_size (Dict[str, int] optional, defaults to self.crop_size) — Size of the output video after applying center_crop. Can be overridden by crop_size in the preprocess method.
do_rescale (bool, optional, defaults to self.do_rescale) — Whether to rescale the video by the specified scale rescale_factor. Can be overridden by the do_rescale parameter in the preprocess method.
rescale_factor (int or float, optional, defaults to self.rescale_factor) — Scale factor to use if rescaling the video. Only has an effect if do_rescale is set to True. Can be overridden by the rescale_factor parameter in the preprocess method.
do_normalize (bool, optional, defaults to self.do_normalize) — Whether to normalize the video. Can be overridden by the do_normalize parameter in the preprocess method. Can be overridden by the do_normalize parameter in the preprocess method.
image_mean (float or List[float], optional, defaults to self.image_mean) — Mean to use if normalizing the video. This is a float or list of floats the length of the number of channels in the video. Can be overridden by the image_mean parameter in the preprocess method. Can be overridden by the image_mean parameter in the preprocess method.
image_std (float or List[float], optional, defaults to self.image_std) — Standard deviation to use if normalizing the video. This is a float or list of floats the length of the number of channels in the video. Can be overridden by the image_std parameter in the preprocess method. Can be overridden by the image_std parameter in the preprocess method.
do_convert_rgb (bool, optional, defaults to self.image_std) — Whether to convert the video to RGB.
return_tensors (str or TensorType, optional) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors.
data_format (ChannelDimension or str, optional, defaults to ChannelDimension.FIRST) — The channel dimension format for the output video. Can be one of:
- "channels_first" or ChannelDimension.FIRST: video in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: video in (height, width, num_channels) format.
- Unset: Use the channel dimension format of the input video.
input_data_format (ChannelDimension or str, optional) — The channel dimension format for the input video. If unset, the channel dimension format is inferred from the input video. Can be one of:
- "channels_first" or ChannelDimension.FIRST: video in (num_channels, height, width) format.
- "channels_last" or ChannelDimension.LAST: video in (height, width, num_channels) format.
- "none" or ChannelDimension.NONE: video in (height, width) format.
device (torch.device, optional) — The device to process the videos on. If unset, the device is inferred from the input videos.

register_for_auto_class

< source >

( auto_class = 'AutoVideoProcessor' )

Parameters

auto_class (str or type, optional, defaults to "AutoVideoProcessor ") — The auto class to register this new video processor with.

Register this class with a given auto class. This should only be used for custom video processors as the ones in the library are already mapped with AutoVideoProcessor .

This API is experimental and may have some slight breaking changes in the next releases.

save_pretrained

< source >

( save_directory: typing.Union[str, os.PathLike] push_to_hub: bool = False **kwargs )

Parameters

save_directory (str or os.PathLike) — Directory where the video processor JSON file will be saved (will be created if it does not exist).
push_to_hub (bool, optional, defaults to False) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to with repo_id (will default to the name of save_directory in your namespace).
kwargs (Dict[str, Any], optional) — Additional key word arguments passed along to the push_to_hub() method.

Save an video processor object to the directory save_directory, so that it can be re-loaded using the ~video_processing_utils.VideoProcessorBase.from_pretrained class method.

to_dict

< source >

( ) → Dict[str, Any]

Returns

Dict[str, Any]

Dictionary of all the attributes that make up this video processor instance.

Serializes this instance to a Python dictionary.

< > Update on GitHub

Transformers

Video Processor

Usage Example

BaseVideoProcessor

class transformers.BaseVideoProcessor

convert_to_rgb

fetch_videos

from_dict

from_json_file

from_pretrained

get_video_processor_dict

preprocess

register_for_auto_class

save_pretrained

to_dict