Daniel Acuña

ORI: Methods and tools for scalable figure reuse detection with statistical certainty reporting

2018-07-01T00:00:00-06:00

Investigators

Daniel E. Acuña (PI)

Abstract

Fraudulent reuse of scientific figures is an increasingly common problem that damages the public perception of science. The Office of Research Integrity (ORI) reviews whistleblowers’ accusations carefully, taking a reactive approach to investigate this type of misconduct. Recently, Acuna, Brookes, and Kording (2018) used machine learning to detect figure reuse in PubMed Open Access articles which share same junior or senior scientist. They estimated that around 0.6% of the papers were very likely fraudulent. Current image manipulation investigations are however reactive, not across-authors scalable, originated from whistleblowers, and without statistically-supported verdicts.

In this project, we propose to dramatically scale automated detection of figure reuse across articles and collaborate with ORIs and active researchers in the area. We also propose to develop statistical methods to support conclusions regarding figure reuses. Once this project is completed, we hope that the tools and techniques that we will develop will become a standard practice. We expect our research to significantly reduce the acceptance of publications with image manipulation and therefore significantly reduce the incidence of one of the most damaging instances of scientific misconduct.

Acuna lab is looking for students to optimize science using machine learning (new Fall 2019)

2018-06-25T00:00:00-06:00

About the lab

Dr. Acuna is an Assistant Professor in the School of Information Studies at Syracuse University. He currently works on mathematical and computational models of scientific discovery, predictability, and integrity. Please take a moment to look at his background, research, and recent grants.

Professor Acuna teaches courses for the Applied Data Science and Information Management graduate degrees. He is currently the teacher and Professor of Record for the course IST 718: Big Data Analytics.

Past Master’s students have done internships in Silicon Valley (e.g., Airbnb, Google), are working in major consulting companies (e.g., Ernst & Young, Goldman Sachs), and are broadly working as data scientists. Please see the People section.

About the position

Assistant Professor Daniel Acuna from the School of Information Studies (https://acuna.io), leader of the newly-formed Science of Science and Computational Discovery (SOS+CD) Lab, is looking for Master’s students to work on quantitative analysis of big data. Broadly speaking, the SOS+CD Lab works on understanding how science works and semi-automatically generating scientific discoveries from vast, unstructured dataset of full-text publications, citations, and images. The SOS+CD Lab uses a variety of computational techniques including deep learning, natural language processing, graph analytics, image processing and causal inference. The ideal candidate should have an undergraduate major in Computer Science, Engineering, Applied Statistics, Mathematics, or a similar quantitative field.

Requirements

Develop reproducible software and tools to optimally match reviewers and manuscripts based on mathematical objective functions
Write method and result sections for scientific manuscripts
Have advanced computer programming skills in languages such as Python and R. SQL is also desirable
Understand linear algebra, calculus, probability and statistics
Understand machine learning software tools and pipelines in scikit-learn, R, or Spark ML
Understand basic concepts of software engineering
Have good communication skills

Qualifications

Undergraduate (for MS students) or graduate degree in Computer Science, Engineering, Applied Statistics, Applied Mathematics, or similar quantitative fields
Minimum of 2 years of experience with coding in a major programming language such as Python, R, C, C++, or Java. Experience with handling big data with Apache Spark is a plus.
Demonstrable knowledge of linear algebra, calculus, probability, and statistics

Apply

Otherwise, send an email to deacuna AT syr DOT edu and include:

A short introduction of yourself and why you want to work with me
A short CV or a 1-page resume
Your Github repository, preferably with code from a personal project rather than a “class project”.
Your transcripts
Your GRE, GMAT, or equivalent scores

Apply

If you have any questions, do not hesitate in contacting me. If you are thinking of applying to the Ph.D. program, we have a very competitive fully-funded program, and you should contact me first. Otherwise, apply to the Ph.D. program and mention my name in you materials.

Part of the funding for these positions has been generously provided by the National Science Foundation awards #1646763 and #1800956

NSF: Optimizing scientific peer review

2018-06-22T00:00:00-06:00

See in NSF

Investigators

Abstract

Scientific peer review is a central process when deciding who gets published, promoted, or awarded a prize or grant. Consequently, it may have tremendous impact on the career of scientists and the direction of science. Several researchers, however, have shown that scientific peer review can be slow and low-quality. Moreover, some studies have quantified peer review biases - e.g., prejudices against certain ideas - and inconsistencies - e.g., the same work receiving widely different opinions from different groups of peers. These problems delay or sometimes truncate the dissemination of important research, affecting technological development and ultimately the economy. This project analyzes factors that affect the outcomes of peer review, uses these to improve reviewer selection, develops software that optimizes reviewer assignments, and evaluates the resulting models in the real-world context of a scientific journal, major scientific conferences, and massive open, online courses (MOOCs). By the end of this project, the scientific community will have a better understanding of the factors that affect peer review and actionable insights to make peer review better.

The first component of this project quantifies problems in bias, variance, timing, and quality of reviews. This includes direct effects (e.g., do they collaborate or cite one another) and indirect effects (e.g., do they contribute to and hopefully self-identify with the same community). The project also identifies bias as a function of personal characteristics of author and reviewer. These aspects include age, gender, and minority status, and their visibility and centrality within the field. The same general approach is used to predict the timing of reviews, including the choice to accept the review task. Lastly, the research uses this feature set to predict the quality of reviews. The result, for a given manuscript, includes prediction for each possible reviewer’s biases and decision variance, likelihood and timing to participate in the review process, and ultimate review quality. The second component of this project researches and develops techniques to estimate the characteristics of potential reviewers and uses those inferred characteristics to propose, for any given manuscript, a review panel. The techniques optimize the expected value for a cost function that balances the three objectives of reviewer choice variance (bias and covariance), review timing, and review quality. Presumably, this involves suggesting panels comprised of reviewers with complementary expertise and potentially career stage, who understand the topic and are interested in the manuscripts contents. The project allows the option of making these recommendations conditional on the background, characteristics and position of the editor under consideration. Lastly, the project tests the techniques that automatically assign reviewers and analyzes the output of the process in real world applications. In particular, the project collaborates with a large journal, scientific conferences, and massive open, online course (MOOC) organizations. Through random assignments (current methods versus the project’s algorithm), the project evaluates the degree to which the assignment approach produces less reviewer choice variance, faster reviews, and reviews of higher quality. The project creates software and results that can be used by other venues.

Bioscience-scale automated detection of figure element reuse

2018-02-23T00:00:00-07:00

Daniel E Acuna, School of Information Studies, Syracuse University
Paul S Brookes, University of Rochester Medical Center
Konrad P Kording, University of Pennsylvania

BioRXiv

Abstract
Scientists reuse figure elements sometimes appropriately, e.g. when comparing methods, and sometimes inappropriately, e.g. when presenting an old experiment as a new control. To understand such reuse, automatically detecting it would be important. Here we present an analysis of figure element reuse on a large dataset comprising 760 thousand open access articles and 2 million figures. Our algorithm detects figure region reuse, while being robust to rotation, cropping, resizing, and contrast changes, and estimates which of the reuses have biological meaning. Then a three-person panel analyzes how problematic these biological reuses are using contextual information such as captions and full texts. Based on the panel reviews, we estimate that 9% of the biological reuses would be unanimously perceived as at least suspicious. We further estimate that 0.6% of all articles would be unanimously perceived as fraudulent, with inappropriate reuses occurring 43% across articles, 28% within article, and 29% within a figure. Our tool rapidly detects image reuse at scale, promising to be useful to a broad range of people that campaign for scientific integrity. We suggest that a great deal of scientific fraud will be, sooner or later, detectable by automatic methods.

IST 718: Big Data Analytics

2018-02-18T00:00:00-07:00

This is an advanced course: There seem to be no official pre-requisites in the Syracuse University’s catalog system for taking this class. Most students have already taken IST 687 - Introduction to Data Science, which is a nice introduction to the field. However, students will be expected to know programming in Python or R and have some background in linear algebra, calculus, probability, and statistics as well. This means that even if you register for the class, you might not have the necessary background to fully take advantage of what this class has to offer.
If you are in doubt, take the following test, which you should be able to solve relatively easily
Preliminary test

In the past, I have suggested students go through the following courses to grasp the basic math required to be a good data scientist:

Linear algebra: This MIT OCW’s Linear Algebra course, which is free The first couple of lectures cover most you need
Calculus: Another MIT OCW’s Calculus free course. I would recommend Part A and B for IST 718.
Probability and statistics I would recommend the first chapters of DeGroot and Schervish’s book “Probability and Statistics”
Programming: There are plenty of resources online about programming. For programming in Python, I would recommend Jake VanderPlas’s “Python Data Science Handbook”.

Goal

This course is a broad introduction to modern techniques in data science including elastic net regularized regression, random forest, gradient boosting, and deep learning. It emphasizes a statistical learning point of view, and a careful examination of generalization error, model interpretability, feature engineering, and bias-variance tradeoff.

Tools

The tool of choice is Apache Spark on Hadoop’s HDFS. The environment we use is Databricks Community Edition, which runs a highly customized version of the Jupyter Notebook.

Prerequisites

The pre-requistes for this course are a basic knowledge of discrete mathematics, calculus, probability, and Python.

We use the following books:

Python for Data Analysis (PFDA), 2nd Edition
An introduction to Statistical Learning with Applications in R (ISLR) by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
Spark: The Definitive Guide (STDG), Upcoming (expected 2018) by B. Chambers and M. Zaharia,
Deep Learning (DL) by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

Syllabus

NSF: EAGER: Improving scientific innovation by linking funding and scholarly literature

2016-09-01T00:00:00-06:00

See in NSF

Investigator:

Daniel E. Acuña

Abstract

This project identifies scientists and organizations and their topical interests, enabling the tracking of past productivity and impact. By linking scholarly literature and grants, this project creates a unified dataset that captures diverse scientific disciplines and federal grant award types. A web-based levels the playing field for scientists lacking knowledge about research and funding programs. Users are expected to spend less time searching the literature and more time evaluating significance and impact.

This project consolidates disparate repositories of publications and grants, disambiguates and enriches information about scientists and organizations, and builds a web-based tool to help navigate this information. This project solves many of these issues by modeling the relationship approximately 2.6 million grants from the Federal RePORTER, and a consolidated, multi-source dataset of millions of articles from Microsoft Academic Graph (83 M), MEDLINE (25 M), PubMed Open Access Subset (1 M), ArXiv (0.6 M), and the National Bureau of Economic Research [NBER] (14K). The project creates a web-based tool that generates instantaneous reports about publications, grants, scientists, and organizations related to users’ interests. The unified dataset and web tool could revolutionize how Program Officers evaluate proposals and how researchers find fundable ideas, making science faster, more accurate, and less biased.

Articles (7)

Achakulvisut, T, Acuna, DE, Bassett. DS, Kording, KP, Unique subfields of neuroscience exhibit more diverse language Link
Líenard, JF, Achakulvisut, T, Acuna, DE, David, SV, Intellectual Synthesis in Mentorship Determines Success in Academic Careers Link
Harandi, M, Acuna, DE, Differences in productivity patterns for junior and senior NSF grantees
Teplitskiy, M, Acuna, DE, Elamrani-Raoult, A, Körding, K, Evans, J The Social Structure of Consensus in Scientific Review Link
Acuna, DE, Brooks, P, Kording, P (2018) Bioscience-scale automated detection of figure element reuse (2018) BioArXiv, Link
Shema, A, Acuna, DE (2017) Show Me Your App Usage and I Will Tell Who Your Close Friends Are: Predicting User’s Context from Simple Cellphone Activity, CHI 2017, Pages 2929-2935, Denver, Colorado Link
Achakulvisut T, Acuna DE, Ruangrong T, Kording K (2016) Science Concierge: A Fast Content-Based Recommendation System for Scientific Publications. PLoS ONE 11(7): e0158423. doi:10.1371/journal.pone.0158423 Link

Web service and software

eileen.io
- Data ingestion pipeline
- Front end (soon)
- Back end (soon)