Programming with Millions of Examples

What we do


The vast amount of code available on the web is increasing on a daily basis. Open-source hosting sites such as GitHub contain billions of lines of code. Community question-answering sites provide millions of code snippets with corresponding text and metadata. The amount of code available in executable binaries is even greater. In this project, we develop techniques for learning from such "big code" and leveraging the learned models for program analysis, program synthesis and reverse engineering. Along the way, we explore a range of semantic program representations (e.g., symbolic automata, tracelets, and numerical abstractions), different statistical models capturing regularities in a code base, as well as different models for similarity. To put the techniques to the test, we explore their applications to semantic code search, code completion and reverse engineering.

[Supported by an ERC grant]

Publications


Code2Seq: Generating Sequences from Structured Representations of Code
Uri Alon, Omer Levy, and Eran Yahav.
ICLR'19: International Conference on Learning Representations
[pdf] [online demo] [code and models]
Code2Vec: Learning Distributed Representations of Code
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav.
POPL'19: Principles of Programming Languages
[pdf] [online demo] [code and models]
Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples
Gail Weiss, Yoav Goldberg, and Eran Yahav.
ICML'18: The International Conference on Machine Learning
[pdf] [TL;DR]
On the Practical Computational Power of Finite Precision RNNs for Language Recognition
Gail Weiss, Yoav Goldberg, and Eran Yahav.
ACL'18: Annual Meeting of the Association for Computational Linguistics
[pdf] [TL;DR]
A General Path-Based Representation for Predicting Program
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav.
PLDI'18: Programming Languages Design and Implementation
[pdf]
Programming Not Only by Example
Hila Peleg, Sharon Shoham, and Eran Yahav.
ICSE'18: International Conference on Software Engineering
[pdf]
Generating Tests by Example
Hila Peleg, Dany Rasin, and Eran Yahav.
VMCAI'18: International Conference on Verification, Model Checking, and Abstract Interpretation
Synthesis with Abstract Examples
Dana Drachsler Cohen, Sharon Shoham, and Eran Yahav.
CAV'17: Computer Aided Verification
[pdf]
Learning Disjunctions of Predicates
Nader Bshouty, Dana Drachsler Cohen, Martin Vechev, and Eran Yahav.
COLT'17: Conference On Learning Theory
Synthesis of Forgiving Data Extractors
Adi Omari, Sharon Shoham, and Eran Yahav.
WSDM'17: ACM Conference on Web Search and Data Mining
Similarity of Binaries through Re-optimization
Yaniv David, Nimrod Partush, and Eran Yahav.
PLDI'17: Programming Languages Design and Implementation
Leveraging a Corpus of Natural Language Descriptions for Program Similarity
Meital Zilberstein and Eran Yahav.
ONWARD'16: Symposium on New Ideas in Programming and Reflections on Software
[PDF][like2drops]
Extracting Code from Programming Tutorial Videos
Shir Yadid and Eran Yahav.
ONWARD'16: Symposium on New Ideas in Programming and Reflections on Software
[PDF][video]
Lossless Separation of Web Pages into Layout Code and Data
Adi Omari, Benny Kimelfeld, Sharon Shoham, and Eran Yahav.
KDD'16: ACM SIGKDD Conference on Knowledge Discovery and Data Mining
[PDF]
Cross-Supervised Synthesis of Web-Crawlers
Adi Omari, Sharon Shoham, and Eran Yahav.
ICSE'16: the 38th International Conference on Software Engineering
[PDF]
Statistical Similarity of Binaries
Yaniv David, Nimrod Partush, and Eran Yahav.
PLDI'16: Programming Languages Design and Implementation
[pdf] [TL;DR] [Esh ]
D3: Data-Driven Disjunctive Abstraction
Hila Peleg, Sharon Shoham, Eran Yahav
VMCAI'16: International Conference on Verification, Model Checking, and Abstract Interpretation
[pdf] [TL;DR]
Estimating Types in Binaries using Predictive Modeling
Omer Katz, Ran El-Yaniv, Eran Yahav
POPL'16: ACM SIGPLAN Conference on Principles of Programming Languages
[pdf] [TL;DR]
Abstract Semantic Differencing via Speculative Correlation
Nimrod Partush, Eran Yahav
OOPSLA'14: ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications
[pdf] [TL;DR]
Tracelet-Based Code Search in Executables
Yaniv David, Eran Yahav
PLDI'14: ACM Conference on Programming Language Design and Implementation
[pdf] [slides] [code] [TL;DR]
Code Completion with Statistical Language Models
Veselin Raychev, Martin Vechev, Eran Yahav
PLDI'14: ACM Conference on Programming Language Design and Implementation
[pdf] [slides] [TL;DR]
Symbolic Automata for Specification Mining
Peleg H., Shoham S., Eran Yahav, Yang H.
SAS'13: The 20th International Static Analysis Symposium
[pdf] [slides] [TL;DR]
Typestate-Based Semantic Code Search over Partial Programs
Mishne A., Shoham S., Yahav E.
OOPSLA'12: ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications
[pdf] [slides] [code] [TL;DR]

Brewing


Talks


Code2Vec

Informal talk at the "Machine Learning for Programming" workshop

Code2Vec

Uri Alon's POPL'19 Talk

Code2Seq

EPFL Seminar

Extraction from RNNs

Gail Weiss at ICML'18

RNN Computational Power

Gail Weiss at ACL'18

Symbolic Automata

Hila Peleg at MSR

Programming with Millions of Examples

A relatively old talk at Zurich workshop, but covers some of the ideas at a high level.

Programming with Millions of Example

Talk at ETH Distinguished Colloquium, December 2014

Analysis and Synthesis with "Big Code"

Talk at Marktoberdorf Summer School 2015

Opportunities and Challenges in Program Simliarity

Talk at ML4PL workshop

Analysis and Synthesis with "Big Code"

Talk at ECOOP Summer School 2015

Abstract Semantic Differencing for Numerical Programs

Talk at VSSE'13

Software


Code2Seq

Code to NL Sequences

Code2Vec

Code to vector

DFA Extraction

RNN to DFA

PRIME

Basic Java Implementation of PRIME

DIZY

Program Differencing

TRACY

Code Search in Binaries

Esh

Statistical Similarity of Binaries

Like2Drops

Cross-Language Similarity

Commercial


Codota

Codota AI Completions

Codota Index

Codota code search

Contact

Computer Science Department
Technion, Israel

+972 48294318
yahave@cs.technion.ac.il

ERC banner