diff --git a/LICENSE b/LICENSE index 8ddfc99..3fd0044 100644 --- a/LICENSE +++ b/LICENSE @@ -1,4 +1,4 @@ -Copyright 2020-2021 by EPFL/VITA. All rights reserved. +Copyright 2018-2021 by EPFL/VITA. All rights reserved. This project and all its files are licensed under GNU AGPLv3 or later version. diff --git a/README.md b/README.md index 018f92e..f939a29 100644 --- a/README.md +++ b/README.md @@ -355,7 +355,7 @@ When using this library in your research, we will be happy if you cite us! @InProceedings{bertoni_2021_icra, author = {Bertoni, Lorenzo and Kreiss, Sven and Mordan, Taylor and Alahi, Alexandre}, title = {MonStereo: When Monocular and Stereo Meet at the Tail of 3D Human Localization}, - booktitle = {International Conference on Robotics and Automation}, + booktitle = {International Conference on Robotics and Automation (ICRA)}, year = {2021} } ``` diff --git a/docs/MonStereo.md b/docs/MonStereo.md deleted file mode 100644 index 48f1ddf..0000000 --- a/docs/MonStereo.md +++ /dev/null @@ -1,151 +0,0 @@ - -# MonStereo - - - > Monocular and stereo vision are cost-effective solutions for 3D human localization - in the context of self-driving cars or social robots. However, they are usually developed independently - and have their respective strengths and limitations. We propose a novel unified learning framework that - leverages the strengths of both monocular and stereo cues for 3D human localization. - Our method jointly (i) associates humans in left-right images, - (ii) deals with occluded and distant cases in stereo settings by relying on the robustness of monocular cues, - and (iii) tackles the intrinsic ambiguity of monocular perspective projection by exploiting prior knowledge - of human height distribution. -We achieve state-of-the-art quantitative results for the 3D localization task on KITTI dataset -and estimate confidence intervals that account for challenging instances. -We show qualitative examples for the long tail challenges such as occluded, -far-away, and children instances. - -``` -@InProceedings{bertoni_monstereo, -author = {Bertoni, Lorenzo and Kreiss, Sven and Mordan, Taylor and Alahi, Alexandre}, -title = {MonStereo: When Monocular and Stereo Meet at the Tail of 3D Human Localization}, -booktitle = {arXiv:2008.10913}, -month = {August}, -year = {2020} -} -``` - -# Prediction -The predict script receives an image (or an entire folder using glob expressions), -calls PifPaf for 2d human pose detection over the image -and runs MonStereo for 3d location of the detected poses. - -Output options include json files and/or visualization of the predictions on the image in *frontal mode*, -*birds-eye-view mode* or *multi mode* and can be specified with `--output_types` - - -### Pre-trained Models -* Download Monstereo pre-trained model from -[Google Drive](https://drive.google.com/drive/folders/1jZToVMBEZQMdLB5BAIq2CdCLP5kzNo9t?usp=sharing), -and save them in `data/models` -(default) or in any folder and call it through the command line option `--model ` -* Pifpaf pre-trained model will be automatically downloaded at the first run. -Three standard, pretrained models are available when using the command line option -`--checkpoint resnet50`, `--checkpoint resnet101` and `--checkpoint resnet152`. -Alternatively, you can download a Pifpaf pre-trained model from [openpifpaf](https://github.com/vita-epfl/openpifpaf) - and call it with `--checkpoint `. All experiments have been run with v0.8 of pifpaf. - If you'd like to use an updated version, we suggest to re-train the MonStereo model as well. -* The model for the experiments is provided in *data/models/ms-200710-1511.pkl* - - -### Ground truth matching -* In case you provide a ground-truth json file to compare the predictions of MonSter, - the script will match every detection using Intersection over Union metric. - The ground truth file can be generated using the subparser `prep` and called with the command `--path_gt`. -As this step requires running the pose detector over all the training images and save the annotations, we -provide the resulting json file for the category *pedestrians* from -[Google Drive](https://drive.google.com/file/d/1e-wXTO460ip_Je2NdXojxrOrJ-Oirlgh/view?usp=sharing) -and save it into `data/arrays`. - -* In case the ground-truth json file is not available, with the command `--show_all`, is possible to -show all the prediction for the image - -After downloading model and ground-truth file, a demo can be tested with the following commands: - -`python3 -m monstereo.run predict --glob docs/000840*.png --output_types multi --scale 2 - --model data/models/ms-200710-1511.pkl --z_max 30 --checkpoint resnet152 --path_gt data/arrays/names-kitti-200615-1022.json - -o data/output` - -![Crowded scene](out_000840.jpg) - -`python3 -m monstereo.run predict --glob docs/005523*.png --output_types multi --scale 2 - --model data/models/ms-200710-1511.pkl --z_max 30 --checkpoint resnet152 --path_gt data/arrays/names-kitti-200615-1022.json - -o data/output` - -![Occluded hard example](out_005523.jpg) - -# Preprocessing -Preprocessing and training step are already fully supported by the code provided, -but require first to run a pose detector over -all the training images and collect the annotations. -The code supports this option (by running the predict script and using `--mode pifpaf`). - -### Data structure - - Data - ├── arrays - ├── models - ├── kitti - ├── logs - ├── output - -Run the following to create the folders: -``` -mkdir data -cd data -mkdir arrays models kitti logs output -``` - - -### Datasets -Download KITTI ground truth files and camera calibration matrices for training -from [here](http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d) and -save them respectively into `data/kitti/gt` and `data/kitti/calib`. -To extract pifpaf joints, you also need to download training images soft link the folder in ` -data/kitti/images` - - -### Annotations to preprocess -MonStereo is trained using 2D human pose joints. To obtain the joints the first step is to run -pifaf over KITTI training images, by either running the predict script and using `--mode pifpaf`, - or by using pifpaf code directly. -MonStereo preprocess script expects annotations from left and right images in 2 different folders -with the same path apart from the suffix `_right` for the ``right" folder. -For example `data/annotations` and `data/annotations_right`. - Do not change name of json files created by pifpaf. For each left annotation, - the code will look for the corresponding right annotation. - -### Inputs joints for training -MonoStereo is trained using 2D human pose joints matched with the ground truth location provided by -KITTI Dataset. To create the joints run: `python3 -m monstereo.run prep` specifying: - -`--dir_ann` annotation directory containing Pifpaf joints of KITTI for the left images. - - -### Ground truth file for evaluation -The preprocessing script also outputs a second json file called **names-.json** which provide a dictionary indexed -by the image name to easily access ground truth files for evaluation and prediction purposes. - - -# Training -Provide the json file containing the preprocess joints as argument. -As simple as `python3 -m monstereo.run train --joints ` -All the hyperparameters options can be checked at `python3 -m monstereo.run train --help`. - -# Evaluation (KITTI Dataset) -### Average Localization Metric (ALE) -We provide evaluation on KITTI in the eval section. Txt files for MonStereo are generated with the command: - -`python -m monstereo.run eval --dir_ann --model data/models/ms-200710-1511.pkl --generate` - - - -### Relative Average Precision Localization (RALP-5%) -We modified the original C++ evaluation of KITTI to make it relative to distance. We use **cmake**. -To run the evaluation, first generate the txt files with: - -`python -m monstereo.run eval --dir_ann --model data/models/ms-200710-1511.pkl --generate` - -Then follow the instructions of this [repository](https://github.com/cguindel/eval_kitti) -to prepare the folders accordingly (or follow kitti guidelines) and run evaluation. -The modified file is called *evaluate_object.cpp* and runs exactly as the original kitti evaluation. diff --git a/docs/MonoLoco++.md b/docs/MonoLoco++.md deleted file mode 100644 index fd52cec..0000000 --- a/docs/MonoLoco++.md +++ /dev/null @@ -1,230 +0,0 @@ - -# Perceiving Humans: from Monocular 3D Localization to Social Distancing - -> Perceiving humans in the context of Intelligent Transportation Systems (ITS) -often relies on multiple cameras or expensive LiDAR sensors. -In this work, we present a new cost- effective vision-based method that perceives humans’ locations in 3D -and their body orientation from a single image. -We address the challenges related to the ill-posed monocular 3D tasks by proposing a deep learning method -that predicts confidence intervals in contrast to point estimates. Our neural network architecture estimates -humans 3D body locations and their orientation with a measure of uncertainty. -Our vision-based system (i) is privacy-safe, (ii) works with any fixed or moving cameras, - and (iii) does not rely on ground plane estimation. - We demonstrate the performance of our method with respect to three applications: - locating humans in 3D, detecting social interactions, - and verifying the compliance of recent safety measures due to the COVID-19 outbreak. - Indeed, we show that we can rethink the concept of “social distancing” as a form of social interaction - in contrast to a simple location-based rule. We publicly share the source code towards an open science mission. - -``` -@InProceedings{bertoni_social, -author = {Bertoni, Lorenzo and Kreiss, Sven and Alahi, Alexandre}, -title={Perceiving Humans: from Monocular 3D Localization to Social Distancing}, -booktitle = {arXiv:2009.00984}, -month = {September}, -year = {2020} -} -``` -![social distancing](social_distancing.jpg) - -## Predictions -For a quick setup download a pifpaf and a MonoLoco++ models from -[here](https://drive.google.com/drive/folders/1jZToVMBEZQMdLB5BAIq2CdCLP5kzNo9t?usp=sharing) and save them into `data/models`. - -### 3D Localization -The predict script receives an image (or an entire folder using glob expressions), -calls PifPaf for 2d human pose detection over the image -and runs Monoloco++ for 3d location of the detected poses. -The command `--net` defines if saving pifpaf outputs, MonoLoco++ outputs or MonStereo ones. -You can check all commands for Pifpaf at [openpifpaf](https://github.com/vita-epfl/openpifpaf). - -Output options include json files and/or visualization of the predictions on the image in *frontal mode*, -*birds-eye-view mode* or *combined mode* and can be specified with `--output_types` - -Ground-truth KITTI files for comparing results can be downloaded from -[here](https://drive.google.com/drive/folders/1jZToVMBEZQMdLB5BAIq2CdCLP5kzNo9t?usp=sharing) -(file called *names-kitti*) and should be saved into `data/arrays` -Ground-truth files can also be generated, more info in the preprocessing section. - -For an example image, run the following command: - -``` -python -m monstereo.run predict \ -docs/002282.png \ ---net monoloco_pp \ ---output_types multi \ ---model data/models/monoloco_pp-201203-1424.pkl \ ---path_gt data/arrays/names-kitti-200615-1022.json \ --o \ ---long-edge ---n_dropout <50 to include epistemic uncertainty, 0 otherwise> -``` - -![predict](out_002282.png.multi.jpg) - -To show all the instances estimated by MonoLoco add the argument `show_all` to the above command. - -![predict_all](out_002282.png.multi_all.jpg) - -It is also possible to run [openpifpaf](https://github.com/vita-epfl/openpifpaf) directly -by specifying the network with the argument `--net pifpaf`. All the other pifpaf arguments are also supported -and can be checked with `python -m monstereo.run predict --help`. - -![predict_all](out_002282_pifpaf.jpg) - -### Focal Length and Camera Parameters -Absolute distances are affected by the camera intrinsic parameters. -When processing KITTI images, the network uses the provided intrinsic matrix of the dataset. -In all the other cases, we use the parameters of nuScenes cameras, with "1/1.8'' CMOS sensors of size 7.2 x 5.4 mm. -The default focal length is 5.7mm and this parameter can be modified using the argument `--focal`. - -### Social Distancing -To visualize social distancing compliance, simply add the argument `--social-distance` to the predict command. - -An example from the Collective Activity Dataset is provided below. - - - -To visualize social distancing run the below, command: -``` -python -m monstereo.run predict \ -docs/frame0038.jpg \ ---net monoloco_pp \ ---social_distance \ ---output_types front bird --show_all \ ---model data/models/monoloco_pp-201203-1424.pkl -o -``` - - - - - -Threshold distance and radii (for F-formations) can be set using `--threshold-dist` and `--radii`, respectively. - -For more info, run: - -`python -m monstereo.run predict --help` - -### Orientation and Bounding Box dimensions -MonoLoco++ estimates orientation and box dimensions as well. Results are saved in a json file when using the command -`--output_types json`. At the moment, the only visualization including orientation is the social distancing one. - -## Preprocessing - -### Kitti -Annotations from a pose detector needs to be stored in a folder. -For example by using [openpifpaf](https://github.com/vita-epfl/openpifpaf): -``` -python -m openpifpaf.predict \ ---glob "/*.png" \ ---json-output ---checkpoint=shufflenetv2k30 \ ---instance-threshold=0.05 --seed-threshold 0.05 --force-complete-pose -``` -Once the step is complete: -`python -m monstereo.run prep --dir_ann --monocular` - - -### Collective Activity Dataset -To evaluate on of the [collective activity dataset](http://vhosts.eecs.umich.edu/vision//activity-dataset.html) - (without any training) we selected 6 scenes that contain people talking to each other. - This allows for a balanced dataset, but any other configuration will work. - -THe expected structure for the dataset is the following: - - collective_activity - ├── images - ├── annotations - -where images and annotations inside have the following name convention: - -IMAGES: seq_frame.jpg -ANNOTATIONS: seq_annotations.txt - -With respect to the original datasets the images and annotations are moved to a single folder -and the sequence is added in their name. One command to do this is: - -`rename -v -n 's/frame/seq14_frame/' f*.jpg` - -which for example change the name of all the jpg images in that folder adding the sequence number - (remove `-n` after checking it works) - -Pifpaf annotations should also be saved in a single folder and can be created with: - -``` -python -m openpifpaf.predict \ ---glob "data/collective_activity/images/*.jpg" \ ---checkpoint=shufflenetv2k30 \ ---instance-threshold=0.05 --seed-threshold 0.05 --force-complete-pose\ ---json-output /data/lorenzo-data/annotations/collective_activity/v012 -``` - -Finally, to evaluate activity using a MonoLoco++ pre-trained model trained either on nuSCENES or KITTI: -``` -python -m monstereo.run eval --activity \ ---net monoloco_pp --dataset collective \ ---model --dir_ann -``` - -## Training -We train on KITTI or nuScenes dataset specifying the path of the input joints. - -Our results are obtained with: - -`python -m monstereo.run train --lr 0.001 --joints data/arrays/joints-kitti-201202-1743.json --save --monocular` - -For a more extensive list of available parameters, run: - -`python -m monstereo.run train --help` - -## Evaluation - -### 3D Localization -We provide evaluation on KITTI for models trained on nuScenes or KITTI. We compare them with other monocular -and stereo Baselines: - -[MonoLoco](https://github.com/vita-epfl/monoloco), -[Mono3D](https://www.cs.toronto.edu/~urtasun/publications/chen_etal_cvpr16.pdf), -[3DOP](https://xiaozhichen.github.io/papers/nips15chen.pdf), -[MonoDepth](https://arxiv.org/abs/1609.03677) -[MonoPSR](https://github.com/kujason/monopsr) and our -[MonoDIS](https://research.mapillary.com/img/publications/MonoDIS.pdf) and our -[Geometrical Baseline](monoloco/eval/geom_baseline.py). - -* **Mono3D**: download validation files from [here](http://3dimage.ee.tsinghua.edu.cn/cxz/mono3d) -and save them into `data/kitti/m3d` -* **3DOP**: download validation files from [here](https://xiaozhichen.github.io/) -and save them into `data/kitti/3dop` -* **MonoDepth**: compute an average depth for every instance using the following script -[here](https://github.com/Parrotlife/pedestrianDepth-baseline/tree/master/MonoDepth-PyTorch) -and save them into `data/kitti/monodepth` -* **GeometricalBaseline**: A geometrical baseline comparison is provided. - -The average geometrical value for comparison can be obtained running: -``` -python -m monstereo.run eval ---dir_ann ---model ---net monoloco_pp ---generate -```` - -To include also geometric baselines and MonoLoco, add the flag ``--baselines`` - - - -Adding the argument `save`, a few plots will be added including 3D localization error as a function of distance: - - -### Activity Estimation (Talking) -Please follow preprocessing steps for Collective activity dataset and run pifpaf over the dataset images. -Evaluation on this dataset is done with models trained on either KITTI or nuScenes. -For optimal performances, we suggest the model trained on nuScenes teaser (TODO add link) -``` -python -m monstereo.run eval ---activity ---dataset collective ---net monoloco_pp ---model ---dir_ann -```