publications | Ziteng Pang

2025

TMLR

Thoughts and Lessons on Using Visual Foundation Models for Manipulation

Chen, Ryan, Pang, Ziteng, and Stadie, Bradly C.

Transactions on Machine Learning Research 2025

Abs PDF

Training vision-based robotic systems from scratch is both computationally expensive and memory intensive. To mitigate these challenges, recent approaches forgo end-to-end training in favor of adopting visual representations from visual foundation models – large scale models designed for broad task transferability. Recent years have seen numerous vision foundation models emerge, including several designed specifically for manipulation tasks. However, we still lack clear principles for what makes these models effective for robotics applications. To address this gap, we systematically evaluate vision foundation models to understand what makes them effective for offline robotic learning. We find that across eleven diverse vision encoders, a representation’s ability to reconstruct edges and predict keypoints strongly correlates with its performance on manipulation tasks. Extensive correlation analysis across 21 manipulation tasks consistently shows that representations preserving edge and keypoint information achieve the highest environment success rates. These findings appear to challenge conventional wisdom about holistic reconstruction-based pretraining and offer a new lens for understanding what makes vision representations effective for robotics.
NPJ Digit. Med.

Expert of Experts Verification and alignment (EVAL) framework for large language models safety in gastroenterology

Giuffrè, Mauro, You, Kisung, Pang, Ziteng, Kresevic, Simone, Chung, Sunny, Chen, Ryan, Ko, Youngmin, Chan, Colleen, Saarinen, Theo, Ajcevic, Milos, Crocè, Lory S, Garcia-Tsao, Guadalupe, Gralnek, Ian, Sung, Joseph J Y, Barkun, Alan, Laine, Loren, Sekhon, Jasjeet, Stadie, Bradly, and Shung, Dennis L

NPJ Digit. Med. May 2025

Abs PDF

Large language models generate plausible text responses to medical questions, but inaccurate responses pose significant risks in medical decision-making. Grading LLM outputs to determine the best model or answer is time-consuming and impractical in clinical settings; therefore, we introduce EVAL (Expert-of-Experts Verification and Alignment) to streamline this process and enhance LLM safety for upper gastrointestinal bleeding (UGIB). We evaluated OpenAI’s GPT-3.5/4/4o/o1-preview, Anthropic’s Claude-3-Opus, Meta’s LLaMA-2 (7B/13B/70B), and Mistral AI’s Mixtral (7B) across 27 configurations, including zero-shot baseline, retrieval-augmented generation, and supervised fine-tuning. EVAL uses similarity-based ranking and a reward model trained on human-graded responses for rejection sampling. Among the employed similarity metrics, Fine-Tuned ColBERT achieved the highest alignment with human performance across three separate datasets (ρ= 0.81-0.91). The reward model replicated human grading with 87.9% of cases across temperature settings and significantly improved accuracy through rejection sampling by 8.36% overall. EVAL offers scalable potential to assess accuracy for high-stakes medical decision-making.

2022

ICML

Scalable Bayesian Inference for Detection and Deblending in Astronomical Images

Hansen, Derek, Mendoza, Ismael, Liu, Runjing, Pang, Ziteng, Zhao, Zhe, Avestruz, Camille, and Regier, Jeffrey

May 2022

Abs HTML PDF Code

We present a new probabilistic method for detecting, deblending, and cataloging astronomical objects called the Bayesian Light Source Separator (BLISS). BLISS is based on deep generative models, which embed neural networks within a Bayesian model. For posterior inference, BLISS uses a new form of variational inference known as Forward Amortized Variational Inference (FAVI). FAVI has scaling advantages over Markov chain Monte Carlo and achieves improved fidelity of the posterior approximation compared with traditional variational inference in our application. The BLISS inference routine is fast, requiring a single forward pass of the encoder networks on a GPU once the encoder networks are trained. BLISS can perform fully Bayesian inference on megapixel images in seconds, and produces highly accurate catalogs than traditional methods do. BLISS is highly extensible, and has the potential to directly answer downstream scientific questions in addition to producing probabilistic catalogs.