About ============== Project Scope ___________________ In this project we aim to develop a probabilistic programming language that supports a distributed computational model like MapReduce. This language will be implemented as a Scala API, so that the user can mix probabilistic and non-probabilistic code in the application, and will be integrated into Spark and Flink, two platforms for big data processing. This programming language will be developed as an open software project freely available to the scientific and professional community interested in the use of machine learning techniques (ML) for big data problems. | .. image:: _static/img/ML.jpg :scale: 30 % :align: center | ML is a field of research that is widely established providing many solutions to major problems of society such as web search engines, genetic data analysis, self-driven cars, etc. The current availability by governments, institutions and corporations of large volumes of data is accelerating the development of such applications. However, developing a ML algorithm that fits a specific problem requires of programmers with expertise in various fields such as statistics, probabilistic modeling, optimization algorithms, etc. This translates, in most of the cases, to the need of highly qualified people who either are not available in the labor market or represent a high cost for the project. In the case of dealing with large volumes of data, these problems are exacerbated since it also requires that the designed ML algorithm to be parallelizable and scalable. All these factors are weighing on the process of ML applications development. So, usually, only large corporations have the technical and financial capacity to carry them out. For these reasons, the prestigious US Defense Advanced Research Projects Agency (DARPA) is currently funding a major programme in the area of probabilistic programming languages (PPLs). According to this agency, the PPLs could offer a solution to all these problems, since they allow to separate the model specification from the learning algorithm. For many experts, the PPLs could revolutionize the field of ML and scientific modeling in the same way that the appearance of the general-purpose programming languages revolutionized the field of software development fifty years ago, by freeing developers from the need to know the details of the hardware on which the program was running. | .. image:: _static/img/ML2.png :scale: 70 % :align: center | One of the main problems still unresolved in this field is the definition of PPLs with an inference engine that is scalable and has the capacity to process large volumes of data. In this project we aim to extend and adapt the developments of the European research project AMIDST (FP7-ICT-619 209) to achieve this goal. The AMIDST project, where both the main researcher and the supervisor are involved, is based on the development of scalable inference and learning algorithms for probabilistic graphical models, in the context of massive data streams. The project INFER.scala aims to define a PPL whose programs can be compiled to a probabilistic graphical model and to build a scalable inference engine for this PPL on top of the AMIDST's algorithms. Why PPLs? ___________________ Machine learning (ML) [1,2] is, at present, a fundamental part of many artificial intelligence techniques [3]. According to this field of research, it is much more effective to develop methods that allow machines to learn to perform tasks on their own, than to program them specifically for each one. This idea has been a real revolution in many other fields of research such as natural language processing [4], predictive analysis [2], cybersecurity [5], bioinformatics [6], and so on. In addition, it has allowed the development of highly innovative applications such as self-employed cars [7], image search [8], junk e-mail filters [9], recommendation systems [10], etc. One of the main catalysts of this technology has been the current availability by governments, institutions and corporations of large data volumes [11]. This has allowed ML algorithms to have a greater predictive and modeling capacity [10]. Another reason is the availability of open-source platforms that allow a low-cost cluster of computers to be controlled in a transparent and comparatively simple manner using data replication and hardware failure resistance techniques [12]. This has greatly reduced the cost of storing and processing large volumes of data. Most of these platforms have been developed in the last decade and come from the United States. Those based on the MapReduce programming model [12] are those that are currently being used more and have a higher level of development. Notably Hadoop [13], an open source version of the original model submitted by Google in 2004 [12]; Spark [14] (http://spark.apache.org/), a more modern and recent platform with greater processing capacity, which is being widely adopted in the business and scientific world; and Flink [15] (https://flink.apache.org), probably the only relevant initiative in Europe right now, which is receiving a lot of attention from the free software community and which is the result of several research projects, including an FP7 framework project (http://stratosphere.eu). All these technological advances have greatly accelerated the development of ML applications over large volumes of data. | .. image:: _static/img/supply.jpg :scale: 90 % :align: center | Unfortunately, for a company or an institution, the development of ML models specific to their problems requires enormous effort, mainly for the following reasons [16,17]: - **It is necessary to have highly qualified experts**. Develop ML models requires knowledge in various fields such as statistics, probabilistic modeling, optimization algorithms, specialized software, parallel and distributed computing, etc. - **Programming a ML model is a complex task where many problems are intermingled**, such as the difficulty of testing the correct functioning of the implementation (a ML model is approximate in nature and the optimal response is unknown in most cases) and the shortage of reusable tools. For this reason, ML applications are often created from scratch. - **It is difficult to find the ML model most suitable for an application**. The chosen models tend to have limited expressivity (ie, linear or non-linear, probabilistic or non-probabilistic approach, ...). In addition, the code that implements the model usually inherits these limitations, making it difficult to inspect, maintain or improve. This also makes it complex to integrate the expert knowledge about the domain of the application. As a result, applications often use generic modeling hypotheses instead of being specific to the application domain. All these factors are weighing the development of ML applications and making it only available for large corporations that have the technical and financial capacity to carry them out. For all these reasons, the prestigious American Agency for Advanced Research Projects (DARPA) has today opened a research funding program in the area of **probabilistic programming languages** (PPLs)[16]. According to this agency, PPLs could offer a solution to all these problems. What's a PPL? ___________________ Broadly speaking, a PPL is like a standard programming language but has two added constructs [16]: (1) the ability to sample random values from a random variable, and (2) the ability to condition the value of a random variable of the program through observations of other random variables. In this way, a program written with a PPL is able to define a probabilistic model [2]. Models in areas as diverse as artificial vision, coding theory, cryptographic protocols and biology can be expressed using PPLs [1,2,17]. | .. image:: _static/img/PPL.jpg :scale: 80 % :align: center | The central element of a PPL is its inference engine [16], which allows to explicitly query the probability distribution specified in the probabilistic program. Depending on the application, the output of the inference engine may be the expected value of a function with respect to this probability distribution, the mode of the distribution, or simply a set of sampled values from the distribution. In this way, a programmer can directly construct a ML model of the phenomenon of interest using the appropriate PPL commands. When the model depends on an unknown quantity, developers introduce a random variable that can be associated with a wide range of probability distributions [2]. A PPL Example ___________________ Consider the following example. A programmer has access to a flow of measurements of various temperature and smoke sensors in a given area and wants to develop a program that alerts the presence of a fire based on these measurements. The programmer has the following expert knowledge: - He knows that the presence of a fire is a rare and very unusual event. - He knows that if there is a fire, the temperature will rise and smoke will occur, and that this situation is going to be captured by the sensors. - He also knows that the sensors are not accurate and may contain noise in their measurements. - It is not clear what the temperature range is when there is no fire, since a requirement is that the software is valid for different geographical areas with different climates. These four elements of expert knowledge could be modeled, respectively, by a probabilistic model as follows: - **Fire as a rare element**: Using a binary variable with a Beta distribution as a priori distribution that expresses that the probability that this binary variable is active is quite low. - **Increased temperature and smoke as signs of fire**: These relationships can be defined through the use of graphic models with qualitative links [18,19]. - **Noise in measurements**: This is a standard relationship that can be modeled by introducing a standard noise model for Gaussian and binary variables, respectively, where the noise level is modeled with another hidden transparent variable to the programmer [20]. - **Uncertainty in the range of temperatures**: This can be solved by having access to a historical data with which to train the model. A fictional example of how a PPL could encode this model in an integrated way with a MapReduce tool is given in Figure 1 (follows a Java 8-based syntax). .. figure:: https://raw.githubusercontent.com/amidst/Infer.scala/master/docs/figs/code_low.jpg :alt: PPL Example PPL Example If, for example, the area to be monitored wasisoutside and the normal temperature depends on the season, it would be very easy for a programmer to update the previous code to consider this new situation. Expected Impact of PPLs ______________________________________ This approach offers the same advantages to the ML community that high-level programming languages offered to the software developer community fifty years ago [21]. Prior to the arrival of high-level languages, developers had to understand both the desired application and the low-level details of the hardware on which the application was running. But once these languages were developed, programming experts were able to focus on application development, while hardware experts focused on the development of compilers. This advance also had an associated cost in terms of execution times and memory consumption of the applications developed, but it entailed a huge increase in the productivity of the development of computer applications. | .. image:: _static/img/impact.png :scale: 80 % :align: center | Probabilistic programming languages could have a similar effect on the development of ML applications. Programmers could specialize in model development while ML experts could focus their efforts on developing reusable inference engines. ML applications developed in this way are likely to suffer some run-time penalties for tailor-made applications, but again this cost will be largely covered by the huge increase in productivity in the development of ML applications. In addition, the number of non-experts who can create applications using a PPL could be vastly greater than now. For these reasons, many experts believe that PPLs could revolutionize the ML field and scientific modeling in general [16,17]. Bibliography ___________________ [1] Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006). [2] Murphy, K. P. Machine Learning: A Probabilistic Perspective (MIT Press, 2012). [3] Russell, S. and Norvig, P. Artificial Intelligence: a Modern Approach (Prentice–Hall, 1995). [4] Sebastiani, Fabrizio. "Machine learning in automated text categorization." ACM computing surveys (CSUR) 34.1 (2002): 1-47. [5] Dua, Sumeet, and Xian Du. Data mining and machine learning in cybersecurity. CRC press, 2011. [6] Larrañaga, Pedro, et al. "Machine learning in bioinformatics." Briefings in bioinformatics 7.1 (2006): 86-112. [7] Markoff, John. "Google Cars Drive Themselves, in Traffic." New York Times 9 (2010). [8] Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527-1554. [9] Guzella, Thiago S., and Walmir M. Caminhas. "A review of machine learning approaches to spam filtering." Expert Systems with Applications 36.7 (2009): 10206-10222. [10] Condie, Tyson, et al. "Machine learning for big data." Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 2013. [11] Zikopoulos, Paul, and Chris Eaton. Understanding big data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media, 2011. [12] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [13] Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, 2010. [14] Zaharia, Matei, et al. "Spark: cluster computing with working sets." Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. Vol. 10. 2010. [15] Alexandrov, Alexander, et al. "The Stratosphere platform for big data analytics." The VLDB Journal—The International Journal on Very Large Data Bases 23.6 (2014): 939-964. [16] Gordon, Andrew D., et al. "Probabilistic programming." Proceedings of the on Future of Software Engineering. ACM, 2014. [17] Ghahramani, Zoubin. "Probabilistic machine learning and artificial intelligence." Nature 521.7553 (2015): 452-459. [18] Druzdzel, Marek J., and Linda C. Van Der Gaag. "Elicitation of probabilities for belief networks: combining qualitative and quantitative information." Proceedings of the Eleventh conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 1995. [19] Masegosa Andres R. et al. "Learning from incomplete data in Bayesian networks with qualitative influences". Submitted to International Journal of Approximate Reasoning. 2015. [20] Masegosa, Andr s R., and Serafın Moral. "An interactive approach for cleaning noisy observations in Bayesian networks with the help of an expert." 6th European Workshop on Probabilistic Graphical Models (PGM 2012). 2012. [21] Wexelblat, Richard L., ed. History of programming languages. Academic Press, 2014. [22] Goodman, Noah D. "The principles and practice of probabilistic programming." ACM SIGPLAN Notices. Vol. 48. No. 1. ACM, 2013. [23] Wood, Frank, Jan Willem van de Meent, and Vikash Mansinghka. "A new approach to probabilistic programming inference." Proceedings of the 17th International conference on Artificial Intelligence and Statistics. 2014. [24] Mansinghka, Vikash, Daniel Selsam, and Yura Perov. "Venture: a higher-order probabilistic programming platform with programmable inference." arXiv preprint arXiv:1404.0099 (2014). [25] Goodman, Noah, et al. "Church: a language for generative models." arXiv preprint arXiv:1206.3255 (2012). [26] Lunn, David J., et al. "WinBUGS-a Bayesian modelling framework: concepts, structure, and extensibility." Statistics and computing 10.4 (2000): 325-337. [27] Fischer, Bernd, and Johann Schumann. "AutoBayes: A system for generating data analysis programs from statistical models." Journal of Functional Programming 13.03 (2003): 483-508. [28] Cooper, Gregory F. "The computational complexity of probabilistic inference using Bayesian belief networks." Artificial intelligence 42.2 (1990): 393-405. [29] Hastings, W. Keith. "Monte Carlo sampling methods using Markov chains and their applications." Biometrika 57.1 (1970): 97-109. [30] Gilks, Walter R., and Pascal Wild. "Adaptive rejection sampling for Gibbs sampling." Applied Statistics (1992): 337-348. [31] Maclaurin, Dougal, and Ryan P. Adams. "Firefly Monte Carlo: Exact MCMC with subsets of data." arXiv preprint arXiv:1403.5693 (2014). [32] Paige, Brooks, and Frank Wood. "A compilation target for probabilistic programming languages." arXiv preprint arXiv:1403.0504 (2014). [33] Minka, Tom, et al. "Infer .NET 2.5." Microsoft Research Cambridge (2012). [34] Kschischang, Frank R., Brendan J. Frey, and Hans-Andrea Loeliger. "Factor graphs and the sum-product algorithm." Information Theory, IEEE Transactions on 47.2 (2001): 498-519. [35] Wainwright, Martin J., and Michael I. Jordan. "Graphical models, exponential families, and variational inference." Foundations and Trends® in Machine Learning 1.1-2 (2008): 1- 305. [36] Minka, Thomas P. "Expectation propagation for approximate Bayesian inference." Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2001. [37] Winn, John M., and Christopher M. Bishop. "Variational message passing." Journal of Machine Learning Research. 2005. [38] Ranganath, Rajesh, Sean Gerrish, and David M. Blei. "Black box variational inference." arXiv preprint arXiv:1401.0118 (2013).