💬 About Me

Hi! My name is Xinyu Yang (杨心妤), a first-year Information Science PhD student at Cornell University, advised by Prof. Yian Yin. I am broadly interested in Applied Machine Learning, Data Science and AI for Common Good. Previously, I got my bachelor’s degree in Computer Science and Technology from Zhejiang University, China. During my undergraduate, I was fortunate to be a research intern at Stanford AI Lab advised by Prof. James Zou. I also experienced a wonderful research internship at Zhejiang University Digital Media Computing and Design (DCD) Lab advised by Prof. Fei Wu.

📖 Educations

  • 2023.08 - Present Ph.D. in Information Science, Cornell University, Ithaca, NY, USA
  • 2019.09 - 2023.06 B.E. in Computer Science and Technology, Zhejiang University, Hangzhou, China

📝 Publications and Preprints

* denotes equal contribution

ICLR 2024
sym

Navigating Dataset Documentation in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face

Xinyu Yang*, Weixin Liang*, James Zou

ICLR 2024

[PDF], [Code]

  • We present a comprehensive large-scale analysis of 7,433 ML dataset documentation on Hugging Face.
  • Based on our research findings, we emphasize the importance of comprehensive dataset documentation and offer suggestions to practitioners on how to write documentation that promotes reproducibility, transparency, and accessibility of their datasets, which can help to improve the overall quality and usability of the dataset community.
Nature Machine Intelligence 2024
sym

What’s documented in AI? Systematic Analysis of 32K AI Model Cards

Weixin Liang*, Nazneen Rajani*, Xinyu Yang*, Ezinwanne Ozoani, Eric Wu, Yiqun Chen, Daniel Scott Smith, James Zou

Nature Machine Intelligence 2024

[PDF], [Code]

  • We conduct a comprehensive analysis of 32,111 AI model documentations on Hugging Face, which provides a systematic assessment of community norms and practices around model documentation through large-scale data science and linguistic analysis.
  • Our findings reveal that while most popular models have model cards, they often vary in detail. Sections on environmental impact, limitations, and evaluation are frequently incomplete, whereas training details are more consistently provided. We also analyze the content of each section to characterize practitioners’ priorities.
NEJM AI 2024
sym

Can large language models provide useful feedback on research papers? A large-scale empirical analysis

Weixin Liang*, Yuhui Zhang*, Hancheng Cao*, Binglu Wang, Daisy Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Smith, Yian Yin, Daniel A. McFarland, James Zou

NEJM AI 2024

[PDF], [Twitter], [Code]

  • We created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers.
  • Our results suggest that LLM and human feedback can complement each other. While human expert review is and should continue to be the foundation of rigorous scientific process, LLM feedback could benefit researchers, especially when timely expert feedback is not available and in earlier stages of manuscript preparation before peer-review.
ACMMM 2023
sym

Reconnecting the Broken Civilization: Patchwork Integration of Fragments from Ancient Manuscripts

Yuqing Zhang*, Zhou Fang*, Xinyu Yang*, Shengyu Zhang, Baoyi He, Huaiyong Dou, Junchi Yan, Yongquan Zhang, Fei Wu

ACM MM 2023 (Oral)

[PDF]

  • We developed a multimodal pipeline for Dunhuang manuscript fragments reconstruction, leveraging text-based localization and a self-supervised contour matching framework, accompanied by a global reconstruction process. Our empirical evaluations reveal that this pipeline exhibits a remarkable success rate in fragment assembly.
ICML 2023
sym

Accuracy on the Curve: On the Nonlinear Correlation of ML Performance Between Data Subpopulation

Weixin Liang*, Yining Mao*, Yongchan Kwon*, Xinyu Yang, James Zou

ICML 2023

[PDF], [Website], [Video], [Code]

  • We show that there is a “moon shape” correlation (parabolic uptrend curve) between the test performance on the majority subpopulation and the minority subpopulation. This nonlinear correlations hold across model architectures, training settings, datasets, and the imbalance between subpopulations.
ICML 2022 Workshop
sym

MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts

Weixin Liang*, Xinyu Yang*, James Zou

Contributed Talk at ICML 2022 Workshop on Shift happens: Crowdsourcing metrics and test datasets beyond ImageNet

[PDF], [Website], [Video], [Code]

  • MetaShift introduces a collection of >10K sets of images with annotated contexts. It enables evaluating how ML works in different contexts (e.g. indoor cat vs outdoor cat).
  • We provided a distance score that measures the amount of distribution shift between any two of the data sets.
  • We presented methods to match labels to ImageNet hierarchy via WordNet ID and construct classification tasks over MetaShift to enable evaluating off-the-shelf ImageNet models.

🎖 Honors and Awards

  • 2022 - 2023 Outstanding Graduates of Zhejiang Province
  • 2022 - 2023 Outstanding Graduates of Zhejiang University
  • 2021 – 2022 First-Class Scholarship for Outstanding Students of Zhejiang University (Top 3%)
  • 2020 – 2021 Second-Class Scholarship for Outstanding Students of Zhejiang University (Top 8%)
  • 2019 – 2020 National Scholarship (Top 1%)
  • 2019 – 2020 First-Class Scholarship for Outstanding Students of Zhejiang University (Top 3%)
  • 2019 – 2020 Outstanding Student Awards in Yunfeng College (15 out of 800 students)

🗒 Teaching

  • Spring 2023 Python Programming, Zhejiang University, Teaching Assistant

🎉 Misc

  • I started learning Chinese Calligraphy (a traditional form of writing characters from the Chinese language through the use of ink and a brush) and sketchy when I was nine. [Gallery]
  • I love art and sports. I enjoy painting, photography, table tennis, piano, and classical music.