The 1st SciNLP workshop will be at AKBC 2020!
Join the mailing list to receive announcements.
We are taking abstract submissions through June 7th! Instructions here
Registration (through the AKBC conference) is not open yet. More info here.
The primary goal of this half-day workshop is to bring together researchers from diverse fields who are interested in extracting and representing knowledge from scientific text, and/or applications or methods for improving access to and understanding of such knowledge. Such research includes, but is not limited to:
We welcome research relevant to processing text in any domain of science (e.g. Biology, Medicine, Computer Science, Physics, Economics, Sociology, etc.) that can come from a variety of text sources (e.g. scholarly papers, surveys and technical reports, patents, tweets by scholars, blogs/tutorials, etc.)
We welcome submissions of short abstracts (1 page max) related to the above research areas. Submissions may include previously published results, late-breaking results, and work in progress. Relevant submissions will be accepted for video presentation in the virtual poster session. The workshop is non-archival, so participants are free to also submit their work for publication elsewhere.
To submit an abstract, please send an email to email@example.com with the subject line “SCINLP submission: [TITLE]”. Please include:
Writing guidelines: These abstracts can be longer than the typical abstracts for a full research paper. Figures and tables are allowed, but will count toward the length limit. References will not count toward the length limit. Abstracts do not have to be about a single paper; we allow abstracts that summarize a collection of works under a unified theme (e.g., a series of closely-related papers that build on each other or tackle a common problem). For writing examples, see the accepted abstracts from last year’s SLKB workshop at AKBC 2019.
All accepted abstracts (and their videos) will be made available online prior to the workshop and remain accessible afterwards.
If you have a disability and require accommodation in order to fully participate in the workshop, please let us know and we’ll be in touch to discuss how we can best address your needs.
Can’t make it? Check out these other upcoming workshops related to NLP and text mining over scientific text:
The tentative schedule is:
Total estimated 285 minutes (4.75 hrs). All times are in PST:
Each invited talk is roughly 25 min (which includes a few minutes of QA and buffer time for transitions).
In light of the activity from the computing community to help with the current virus epidemic, we felt it important to hold a panel discussion on the role of NLP and text mining over scientific text (in particular biomedical literature).
We’ve invited a few of our speakers who regularly collaborate with biomedical domain experts to share their thoughts and answer questions.
We will be curating questions from the audience beforehand (as well as live during the discussion).
Insights from the Organization of International Challenges on Artificial Intelligence in Medical Question Answering
Artificial intelligence (AI) is playing an increasingly important role in our access to information. However, a one-fits-all approach is suboptimal, especially in the medical domain where health-related information is more sensitive due to its potential impact on public health, and where domain-specific aspects such as technical language and case or context-based interpretation have to be taken into account. Bridging the gap between several research areas such as AI, NLP, medical informatics, and computer vision is a promising way to achieve reliable and efficient access to medical information. In recent years, I organized several international challenges to promote research efforts in medical question answering. The organization of these competitions raised key questions in data design, evaluation metrics, and problem formulation. It also offered valuable insights on the critical subtasks that need to be solved, and on the most promising solutions in challenging problems such as restricted training data and multidisciplinary tasks. In this talk, I will share all these insights and the promising perspectives in the addressed tasks, including textual question answering and visual question answering.
Dr. Asma Ben Abacha is a staff scientist at the U.S. National Institutes of Health (NIH), National Library of Medicine (NLM), Lister Hill National Center for Biomedical Communications. Prior to joining the NLM in 2015, she was a researcher at the Luxembourg Institute of Science and Technology and lecturer at the University of Lorraine, France. Dr. Ben Abacha received a Ph.D. in computer science from Paris 11 University, France, a research master’s degree from Paris 13 University, and a software engineering degree from the National School of Computer Sciences (ENSI), Tunisia. She is currently working on medical question answering, visual question answering, and NLP-related projects in the medical domain.
Mining the Citation Graph for Representation Learning and Concept Extraction
The exploding pace of scientific publication has led to a pressing need for tools that automatically make sense of the scientific literature. In this talk, I will describe two recent, simple methods for mining the citation graph to extract meaning from scientific documents. First, representation learning forms the foundation of today’s natural language processing systems, and large pretrained language models (LMs) like BERT learn powerful representations for short texts like words and sentences. But, naively applying the models to produce representations for entire scientific documents, which are necessary for many applications, is ineffective. I will introduce SPECTER, a method for producing scientific document representations using a pretrained LM that is able to achieve state-of-the-art performance by fine-tuning the LM on the citation graph as a signal of document relatedness. Second, I will describe a new concept extraction technique called ForeCite that uses the intuition that new concepts tend to be introduced or popularized by a single paper. By mining this signal from the citation graph, ForeCite achieves much higher precision than previous techniques.
Doug Downey is a research scientist at the Allen Institute for AI, where he works on the Semantic Scholar team, and also an associate professor of Computer Science at Northwestern University. His research interests involve information extraction, natural language processing, and machine learning, with a particular focus on automatically extracting knowledge from large corpora to powering new search and browsing experiences. He has won a best paper award at IJCAI, along with an NSF CAREER award, election to the DARPA Computer Science Study Group, and a Microsoft New Faculty Fellowship.
Putting a Face on Science: Analyzing Author Mentions in Science Journalism Reveal Wide-Spread Ethnic Bias
Media outlets play a key role in spreading scientific knowledge to the general public and raising the profile of researchers among their peers. Yet, given social biases and attention constraints, not all scholars receive equal media coverage. In this talk, I will describe a large-scale study across hundreds of thousands of news stories that uncovers systematic ethnic bias in which authors journalists mention by name when covering science. Using NLP techniques to analyze these stories and controlling for confounds, I will show that this ethnic bias is consistent across multiple types of news media, with even larger disparities for long-form journalism focused on science.
David Jurgens is an Assistant Professor at the University of Michigan in the School of Information and by courtesy in the Department of Computer Science. He received his PhD in Computer Science from UCLA. His research in computational social science combines new methods from natural language processing and data science to discover, explain, and predict human behavior in large social systems.
End-to-end Neural Models for Evidence Retrieval from Biomedical Literature
In this talk I will highlight some ongoing efforts at Google on improving discovery from biomedical literature. Most of the talk will focus on document and evidence retrieval, which is the most common entry point for literature tools. I will discuss three specific technical contributions: 1) synthetic question generation to train biomedical-targeted first-tage retrieval models; 2) retrieval models that encode sparse and dense representations in intuitive and flexible ways, which is critical for the domain; and 3) a joint model for document and evidence retrieval that significantly improves the systems ability to select relevant pieces of evidence from returned documents. All topics in the presentations are key technologies in https://covid19-research-explorer.appspot.com/; http://cslab241.cs.aueb.gr:5000/ and Google’s or AUEB’s submissions to the annual BioASQ challenge. Joint work with many colleagues at Google Research and Athens University of Economics and Business.
Ryan McDonald has been a research scientist at Google since 2006 and an associate research at the Athens University of Economics and Business since 2017. In that time he has been involved in various research efforts that have made user impact on a number of Google products, including Search, Assistant, Translate and Cloud. Prior to Google he completed a PhD at the University of Pennsylvania, which focused on new models for multilingual dependency parsing. This work continued at Google and culminated in the creation of the UniversalDependencies project, co-founded by his team at Google and numerous external collaborators. He currently works on discovery from biomedical literature and other productivity-driven NLP challenges.
Looking for the dark matter within knowledge graphs
Knowledge graphs contain much useful information directly available, but also hidden information that could be leveraged in a variety of ways. Some of this dark matter includes negative instances, missing links, missing nodes and obfuscated patterns. Uncovering and using this hidden information can lead to bigger and more complete graphs, and also to a better understanding of the interaction between structured knowledge in knowledge graphs and unstructured knowledge in text collections. In this talk I will show our exploration of this dark matter in some of the most commonly used KGs – Freebase, NELL and WordNet – and discuss how the different nature of each of these graphs influenced our search and what we found.
Vivi Nastase is a research associate in the Institute for Natural Language Processing at the University of Stuttgart. She obtained a PhD from the University of Ottawa, Canada on the topic of semantic relations. She works mainly on lexical semantics, semantic relations, knowledge acquisition and language evolution, and published about 100 articles on these topics, including a book on “Semantic Relations between Nominals” in the series Synthesis Lectures on Human Language Technologies.
Machine Reading for Precision Medicine
The advent of big data promises to revolutionize medicine by making it more personalized and effective, but big data also presents a grand challenge of information overload. For example, tumor sequencing has become routine in cancer treatment, yet interpreting the genomic data requires painstakingly curating knowledge from a vast biomedical literature, which grows by thousands of papers every day. Electronic medical records contain valuable information to speed up clinical trial recruitment and drug development, but curating such real-world evidence from clinical notes can take hours for a single patient. Natural language processing (NLP) can play a key role in interpreting big data for precision medicine. In particular, machine reading can help unlock knowledge from text by substantially improving curation efficiency. However, standard supervised methods require labeled examples, which are expensive and time-consuming to produce at scale. In this talk, I’ll present Project Hanover, where we overcome the annotation bottleneck by combining deep learning with probabilistic logic, and by exploiting self-supervision from readily available resources such as ontologies and databases. This enables us to extract knowledge from millions of publications, reason efficiently with the resulting knowledge graph by learning neural embeddings of biomedical entities and relations, and apply the extracted knowledge and learned embeddings to supporting precision oncology.
Hoifung Poon is the Senior Director of Biomedical NLP at Microsoft Research and an affiliated professor at the University of Washington Medical School. He leads Project Hanover, with the overarching goal of structuring medical data for precision medicine. He has given tutorials on this topic at top conferences such as the Association for Computational Linguistics (ACL) and the Association for the Advancement of Artificial Intelligence (AAAI). His research spans a wide range of problems in machine learning and natural language processing (NLP), and his prior work has been recognized with Best Paper Awards from premier venues such as the North American Chapter of the Association for Computational Linguistics (NAACL), Empirical Methods in Natural Language Processing (EMNLP), and Uncertainty in AI (UAI). He received his PhD in Computer Science and Engineering from University of Washington, specializing in machine learning and NLP.
Enriching a Web-scale Scientific Taxonomy by Combining Textual and Structural Information
Scientific knowledge is evolving at an unprecedented rate of speed, with new concepts and relationships constantly being discovered from the millions of academic articles being published every month. The Microsoft Academic Graph (MAG) provides a comprehensive, cross-domain scientific taxonomy covering more than 550k concepts. This fast-growing volume of scientific literature accentuates a pressing need for automated capture of emerging knowledge with an updated web-scale taxonomy. In this talk, we introduce two major efforts currently underway to enable MAG to achieve this automated capture with minimal supervision. First, we leverage a BERT-based pre-trained language model (LM) and a web search API to identify candidate concept phrases from textual information in the latest publications. Second, we apply a self-supervised position-enhanced graph neural network (GNN) that encodes local structural information to expand our taxonomy with newly discovered concepts. These two approaches achieve highly accurate concept identification results, and indicate significant improvement of our taxonomy expansion compared with previous approaches. We also discuss the challenges and lessons learned while integrating these state-of-the-art LM and GNN models into the MAG system.
Iris Shen is a principal data scientist at Microsoft Research and holds a Ph.D. in Operations Research from University of Southern California. She is the data science manager for Microsoft Academic project which uses the state-of-the-art AI research to assist humans in scientific exploration. Her current research interests are leveraging techniques in data mining, natural language processing, and recommender systems to explore and understand large-scale document corpus with associated networked systems.
An Introduction to Papers with Code
This talk is an introduction to Papers with Code - a free resource for researchers and practitioners to find and follow the latest state-of-the-art ML papers and code. I will go deeper into the open dataset underlying Papers with Code - the collection of ML papers, code, tasks and results, with links between them. I will talk about challenges of augmenting and keeping it up-to-date this resource by using NLP techniques.
Robert is the co-creator of Papers with Code and a software engineer at Facebook AI. Robert started his career as one of the early developers of Wikipedia where he built the internal search engine. He went on to do a PhD in Applied ML in Computational Biology at University of Cambridge. He co-founded a couple of start-ups, and co-created Papers with Code with Ross Taylor. Currently he is at Facebook AI in London where he is working on Papers with Code, and is passionate about open science and open access.
What does the evidence say? Models to help make sense of the biomedical literature
How do we know if a particular medical intervention actually works better than the alternatives for a given condition and outcome? Ideally one would consult all available evidence from relevant trials that have been conducted to answer this question. Unfortunately, such results are primarily disseminated in natural language articles that describe the conduct and results of clinical trials. This imposes substantial burden on physicians and other domain experts trying to make sense of the evidence. In this talk I will discuss work on designing tasks, corpora, and models that aim to realize natural language technologies that can extract key attributes of clinical trials from articles describing them, and infer the reported findings regarding these. The hope is to use such methods to help domain experts (such as physicians) better access and make sense of unstructured biomedical evidence.
Byron Wallace is an assistant professor in the Khoury College of Computer Sciences at Northeastern University. He holds a PhD in Computer Science from Tufts University, where he was advised by Carla Brodley. He has previously held faculty positions at the University of Texas at Austin and at Brown University. His research is in machine learning and natural language processing, with an emphasis on their application in health informatics.
Registration for SciNLP will be through AKBC 2020, which should be opening soon. There will be a minimal fee for registration, but due to the virtual nature of the conference, the financial burden should be low. We will update this page with specific details once they’re available.
Join the mailing list to receive announcements.
Contact us at firstname.lastname@example.org or on Twitter via #SciNLP!
Hosted on GitHub Pages — Theme by orderedlist