Transformers documentation
Video Processor
Video Processor
A Video Processor is a utility responsible for preparing input features for video models, as well as handling the post-processing of their outputs. It provides transformations such as resizing, normalization, and conversion into PyTorch.
The video processor extends the functionality of image processors by allowing Vision Large Language Models (VLMs) to handle videos with a distinct set of arguments compared to images. It serves as the bridge between raw video data and the model, ensuring that input features are optimized for the VLM.
When adding a new VLM or updating an existing one to enable distinct video preprocessing, saving and reloading the processor configuration will store the video related arguments in a dedicated file named video_preprocessing_config.json
. Don’t worry if you haven’t upadted your VLM, the processor will try to load video related configurations from a file named preprocessing_config.json
.
Usage Example
Here’s an example of how to load a video processor with llava-hf/llava-onevision-qwen2-0.5b-ov-hf
model:
from transformers import AutoVideoProcessor
processor = AutoVideoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
Currently, if using base image processor for videos, it processes video data by treating each frame as an individual image and applying transformations frame-by-frame. While functional, this approach is not highly efficient. Using AutoVideoProcessor
allows us to take advantage of fast video processors, leveraging the torchvision library. Fast processors handle the whole batch of videos at once, without iterating over each video or frame. These updates introduce GPU acceleration and significantly enhance processing speed, especially for tasks requiring high throughput.
Fast video processors are available for all models and are loaded by default when an AutoVideoProcessor
is initialized. When using a fast video processor, you can also set the device
argument to specify the device on which the processing should be done. By default, the processing is done on the same device as the inputs if the inputs are tensors, or on the CPU otherwise. For even more speed improvement, we can compile the processor when using ‘cuda’ as device.
import torch
from transformers.video_utils import load_video
from transformers import AutoVideoProcessor
video = load_video("video.mp4")
processor = AutoVideoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf", device="cuda")
processor = torch.compile(processor)
processed_video = processor(video, return_tensors="pt")
BaseVideoProcessor
class transformers.BaseVideoProcessor
< source >( **kwargs: typing_extensions.Unpack[transformers.processing_utils.VideosKwargs] )
Parameters
- do_resize (
bool
, optional, defaults toself.do_resize
) — Whether to resize the video’s (height, width) dimensions to the specifiedsize
. Can be overridden by thedo_resize
parameter in thepreprocess
method. - size (
dict
, optional, defaults toself.size
) — Size of the output video after resizing. Can be overridden by thesize
parameter in thepreprocess
method. - size_divisor (
int
, optional, defaults toself.size_divisor
) — The size by which to make sure both the height and width can be divided. - default_to_square (
bool
, optional, defaults toself.default_to_square
) — Whether to default to a square video when resizing, if size is an int. - resample (
PILImageResampling
, optional, defaults toself.resample
) — Resampling filter to use if resizing the video. Only has an effect ifdo_resize
is set toTrue
. Can be overridden by theresample
parameter in thepreprocess
method. - do_center_crop (
bool
, optional, defaults toself.do_center_crop
) — Whether to center crop the video to the specifiedcrop_size
. Can be overridden bydo_center_crop
in thepreprocess
method. - do_pad (
bool
, optional) — Whether to pad the video to the(max_height, max_width)
of the videos in the batch. - crop_size (
Dict[str, int]
optional, defaults toself.crop_size
) — Size of the output video after applyingcenter_crop
. Can be overridden bycrop_size
in thepreprocess
method. - do_rescale (
bool
, optional, defaults toself.do_rescale
) — Whether to rescale the video by the specified scalerescale_factor
. Can be overridden by thedo_rescale
parameter in thepreprocess
method. - rescale_factor (
int
orfloat
, optional, defaults toself.rescale_factor
) — Scale factor to use if rescaling the video. Only has an effect ifdo_rescale
is set toTrue
. Can be overridden by therescale_factor
parameter in thepreprocess
method. - do_normalize (
bool
, optional, defaults toself.do_normalize
) — Whether to normalize the video. Can be overridden by thedo_normalize
parameter in thepreprocess
method. Can be overridden by thedo_normalize
parameter in thepreprocess
method. - image_mean (
float
orList[float]
, optional, defaults toself.image_mean
) — Mean to use if normalizing the video. This is a float or list of floats the length of the number of channels in the video. Can be overridden by theimage_mean
parameter in thepreprocess
method. Can be overridden by theimage_mean
parameter in thepreprocess
method. - image_std (
float
orList[float]
, optional, defaults toself.image_std
) — Standard deviation to use if normalizing the video. This is a float or list of floats the length of the number of channels in the video. Can be overridden by theimage_std
parameter in thepreprocess
method. Can be overridden by theimage_std
parameter in thepreprocess
method. - do_convert_rgb (
bool
, optional, defaults toself.image_std
) — Whether to convert the video to RGB. - return_tensors (
str
orTensorType
, optional) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors. - data_format (
ChannelDimension
orstr
, optional, defaults toChannelDimension.FIRST
) — The channel dimension format for the output video. Can be one of:"channels_first"
orChannelDimension.FIRST
: video in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: video in (height, width, num_channels) format.- Unset: Use the channel dimension format of the input video.
- input_data_format (
ChannelDimension
orstr
, optional) — The channel dimension format for the input video. If unset, the channel dimension format is inferred from the input video. Can be one of:"channels_first"
orChannelDimension.FIRST
: video in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: video in (height, width, num_channels) format."none"
orChannelDimension.NONE
: video in (height, width) format.
- device (
torch.device
, optional) — The device to process the videos on. If unset, the device is inferred from the input videos.
Constructs a base VideoProcessor.
convert_to_rgb
< source >( video: torch.Tensor ) → torch.Tensor
Converts a video to RGB format.
Convert a single or a list of urls into the corresponding np.array
objects.
If a single url is passed, the return value will be a single object. If a list is passed a list of objects is returned.
from_dict
< source >( video_processor_dict: typing.Dict[str, typing.Any] **kwargs ) → ~video_processing_utils.VideoProcessorBase
Parameters
- video_processor_dict (
Dict[str, Any]
) — Dictionary that will be used to instantiate the video processor object. Such a dictionary can be retrieved from a pretrained checkpoint by leveraging the~video_processing_utils.VideoProcessorBase.to_dict
method. - kwargs (
Dict[str, Any]
) — Additional parameters from which to initialize the video processor object.
Returns
~video_processing_utils.VideoProcessorBase
The video processor object instantiated from those parameters.
Instantiates a type of ~video_processing_utils.VideoProcessorBase
from a Python dictionary of parameters.
from_json_file
< source >( json_file: typing.Union[str, os.PathLike] ) → A video processor of type ~video_processing_utils.VideoProcessorBase
Instantiates a video processor of type ~video_processing_utils.VideoProcessorBase
from the path to a JSON
file of parameters.
from_pretrained
< source >( pretrained_model_name_or_path: typing.Union[str, os.PathLike] cache_dir: typing.Union[str, os.PathLike, NoneType] = None force_download: bool = False local_files_only: bool = False token: typing.Union[str, bool, NoneType] = None revision: str = 'main' **kwargs )
Parameters
- pretrained_model_name_or_path (
str
oros.PathLike
) — This can be either:- a string, the model id of a pretrained video hosted inside a model repo on huggingface.co.
- a path to a directory containing a video processor file saved using the
~video_processing_utils.VideoProcessorBase.save_pretrained
method, e.g.,./my_model_directory/
. - a path or url to a saved video processor JSON file, e.g.,
./my_model_directory/preprocessor_config.json
.
- cache_dir (
str
oros.PathLike
, optional) — Path to a directory in which a downloaded pretrained model video processor should be cached if the standard cache should not be used. - force_download (
bool
, optional, defaults toFalse
) — Whether or not to force to (re-)download the video processor files and override the cached versions if they exist. - resume_download — Deprecated and ignored. All downloads are now resumed by default when possible. Will be removed in v5 of Transformers.
- proxies (
Dict[str, str]
, optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g.,{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.
The proxies are used on each request. - token (
str
orbool
, optional) — The token to use as HTTP bearer authorization for remote files. IfTrue
, or not specified, will use the token generated when runninghuggingface-cli login
(stored in~/.huggingface
). - revision (
str
, optional, defaults to"main"
) — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, sorevision
can be any identifier allowed by git.
Instantiate a type of ~video_processing_utils.VideoProcessorBase
from an video processor.
Examples:
# We can't instantiate directly the base class *VideoProcessorBase* so let's show the examples on a
# derived class: *LlavaOnevisionVideoProcessor*
video_processor = LlavaOnevisionVideoProcessor.from_pretrained(
"llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
) # Download video_processing_config from huggingface.co and cache.
video_processor = LlavaOnevisionVideoProcessor.from_pretrained(
"./test/saved_model/"
) # E.g. video processor (or model) was saved using *save_pretrained('./test/saved_model/')*
video_processor = LlavaOnevisionVideoProcessor.from_pretrained("./test/saved_model/preprocessor_config.json")
video_processor = LlavaOnevisionVideoProcessor.from_pretrained(
"llava-hf/llava-onevision-qwen2-0.5b-ov-hf", do_normalize=False, foo=False
)
assert video_processor.do_normalize is False
video_processor, unused_kwargs = LlavaOnevisionVideoProcessor.from_pretrained(
"llava-hf/llava-onevision-qwen2-0.5b-ov-hf", do_normalize=False, foo=False, return_unused_kwargs=True
)
assert video_processor.do_normalize is False
assert unused_kwargs == {"foo": False}
get_video_processor_dict
< source >( pretrained_model_name_or_path: typing.Union[str, os.PathLike] **kwargs ) → Tuple[Dict, Dict]
Parameters
- pretrained_model_name_or_path (
str
oros.PathLike
) — The identifier of the pre-trained checkpoint from which we want the dictionary of parameters. - subfolder (
str
, optional, defaults to""
) — In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.
Returns
Tuple[Dict, Dict]
The dictionary(ies) that will be used to instantiate the video processor object.
From a pretrained_model_name_or_path
, resolve to a dictionary of parameters, to be used for instantiating a
video processor of type ~video_processing_utils.VideoProcessorBase
using from_dict
.
preprocess
< source >( videos: typing.Union[typing.List[ForwardRef('PIL.Image.Image')], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), typing.List[ForwardRef('np.ndarray')], typing.List[ForwardRef('torch.Tensor')], typing.List[typing.List[ForwardRef('PIL.Image.Image')]], typing.List[typing.List[ForwardRef('np.ndarrray')]], typing.List[typing.List[ForwardRef('torch.Tensor')]]] **kwargs: typing_extensions.Unpack[transformers.processing_utils.VideosKwargs] )
Parameters
- do_resize (
bool
, optional, defaults toself.do_resize
) — Whether to resize the video’s (height, width) dimensions to the specifiedsize
. Can be overridden by thedo_resize
parameter in thepreprocess
method. - size (
dict
, optional, defaults toself.size
) — Size of the output video after resizing. Can be overridden by thesize
parameter in thepreprocess
method. - size_divisor (
int
, optional, defaults toself.size_divisor
) — The size by which to make sure both the height and width can be divided. - default_to_square (
bool
, optional, defaults toself.default_to_square
) — Whether to default to a square video when resizing, if size is an int. - resample (
PILImageResampling
, optional, defaults toself.resample
) — Resampling filter to use if resizing the video. Only has an effect ifdo_resize
is set toTrue
. Can be overridden by theresample
parameter in thepreprocess
method. - do_center_crop (
bool
, optional, defaults toself.do_center_crop
) — Whether to center crop the video to the specifiedcrop_size
. Can be overridden bydo_center_crop
in thepreprocess
method. - do_pad (
bool
, optional) — Whether to pad the video to the(max_height, max_width)
of the videos in the batch. - crop_size (
Dict[str, int]
optional, defaults toself.crop_size
) — Size of the output video after applyingcenter_crop
. Can be overridden bycrop_size
in thepreprocess
method. - do_rescale (
bool
, optional, defaults toself.do_rescale
) — Whether to rescale the video by the specified scalerescale_factor
. Can be overridden by thedo_rescale
parameter in thepreprocess
method. - rescale_factor (
int
orfloat
, optional, defaults toself.rescale_factor
) — Scale factor to use if rescaling the video. Only has an effect ifdo_rescale
is set toTrue
. Can be overridden by therescale_factor
parameter in thepreprocess
method. - do_normalize (
bool
, optional, defaults toself.do_normalize
) — Whether to normalize the video. Can be overridden by thedo_normalize
parameter in thepreprocess
method. Can be overridden by thedo_normalize
parameter in thepreprocess
method. - image_mean (
float
orList[float]
, optional, defaults toself.image_mean
) — Mean to use if normalizing the video. This is a float or list of floats the length of the number of channels in the video. Can be overridden by theimage_mean
parameter in thepreprocess
method. Can be overridden by theimage_mean
parameter in thepreprocess
method. - image_std (
float
orList[float]
, optional, defaults toself.image_std
) — Standard deviation to use if normalizing the video. This is a float or list of floats the length of the number of channels in the video. Can be overridden by theimage_std
parameter in thepreprocess
method. Can be overridden by theimage_std
parameter in thepreprocess
method. - do_convert_rgb (
bool
, optional, defaults toself.image_std
) — Whether to convert the video to RGB. - return_tensors (
str
orTensorType
, optional) — Returns stacked tensors if set to `pt, otherwise returns a list of tensors. - data_format (
ChannelDimension
orstr
, optional, defaults toChannelDimension.FIRST
) — The channel dimension format for the output video. Can be one of:"channels_first"
orChannelDimension.FIRST
: video in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: video in (height, width, num_channels) format.- Unset: Use the channel dimension format of the input video.
- input_data_format (
ChannelDimension
orstr
, optional) — The channel dimension format for the input video. If unset, the channel dimension format is inferred from the input video. Can be one of:"channels_first"
orChannelDimension.FIRST
: video in (num_channels, height, width) format."channels_last"
orChannelDimension.LAST
: video in (height, width, num_channels) format."none"
orChannelDimension.NONE
: video in (height, width) format.
- device (
torch.device
, optional) — The device to process the videos on. If unset, the device is inferred from the input videos.
register_for_auto_class
< source >( auto_class = 'AutoVideoProcessor' )
Register this class with a given auto class. This should only be used for custom video processors as the ones
in the library are already mapped with AutoVideoProcessor
.
This API is experimental and may have some slight breaking changes in the next releases.
save_pretrained
< source >( save_directory: typing.Union[str, os.PathLike] push_to_hub: bool = False **kwargs )
Parameters
- save_directory (
str
oros.PathLike
) — Directory where the video processor JSON file will be saved (will be created if it does not exist). - push_to_hub (
bool
, optional, defaults toFalse
) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to withrepo_id
(will default to the name ofsave_directory
in your namespace). - kwargs (
Dict[str, Any]
, optional) — Additional key word arguments passed along to the push_to_hub() method.
Save an video processor object to the directory save_directory
, so that it can be re-loaded using the
~video_processing_utils.VideoProcessorBase.from_pretrained
class method.
to_dict
< source >( ) → Dict[str, Any]
Returns
Dict[str, Any]
Dictionary of all the attributes that make up this video processor instance.
Serializes this instance to a Python dictionary.