#AstroHackWeek 2015, day 3

Andy Mueller (NYU) started the day with an introduction to machine learning and the scikit-learn package (of which he is a developer). His tutorial was beautiful, as it followed a fully interactive and operational Jupiter notebook. There was an audible gasp in the room when the astronomers saw the magic of dimensionality reduction in the form of the t-SNE algorithm. It is truly magic! There was much debate in the room about whether it was useful for anything other than blowing your mind (ie, visualization).

In the afternoon, Baron and I refactored our galaxy projection/deprojection code in preparation for inferring the galaxy itself. This refactor was non-trivial (and we learned a lot about the Jupyter notebook!), but it was done by the end of the day. We discussed next steps for the galaxy inference, which we hope to start tomorrow!


#AstroHackWeek 2015, day 2

The second day of #AstroHackWeek started with Juliana Freire (NYU) talking to us about databases and data management. She also talked about map-reduce and its various cousins. Freire is the Executive Director of the Moore-Sloan Data Science Environment at NYU, and so also our boss in this endeavor (AstroHackWeek is supported by the M-S DSE, and also Github and the LSST project). Many good questions came up in Freire's discussion about the special needs of astronomers when it comes to database choice and customization. Freire opined that one of the reasons that the SDSS databases were so successful is that we had Jim Gray (Microsoft; deceased) and a team working full time on making them awesome. I agree!

In the afternoon, Dalya Baron (TAU) and I made fake data for our galaxy-deprojection project. These data were images with finite point-spread function and Gaussian noise. We then showed that by likelihood optimization we can (very easily, I am surprised to say) infer the Euler angles of the projection (and other nuisance parameters, like shifts, re-scalings, and background level). We also showed that if we have two kinds of three-dimensional galaxies making the two-dimensional images, we can fairly confidently decide which three-d galaxy made which two-d image. This is important for deprojecting heterogeneous collections. I ended the day very stoked about all this!


#AstroHackWeek 2015, day 1

We kicked off AstroHackWeek 2015 today, with a huge crowd (some 60-ish) people from all over the world and in all different astrophysics fields (and a range of career stages!). Kyle Barbary (UCB) started the day with an introduction to Python for data analysts and Daniela Huppenkothen (NYU, and the principal organizer of the event) followed with some ideas for exploratory data analysis. They set up Jupyter notebooks for interactive tutorials and the crowd followed along. At some point in the morning, Huppenkothen shocked the crowd by letting them know that admissions to AstroHackWeek had been done (in part) with a random number generator!

In the afternoon, the hacking started: People split up into groups to work on problems brought to the meeting by the participants. Dalya Baron (TAU) and I teamed up to write code to build and project mixtures of Gaussians in preparation for an experiment in classifying and determining the projections (Euler angles) of galaxies. This is the project that Leslie Greengard and I have been discussing at the interface of molecular microscopy and astrophysics; that is, anything we figure out here could be used in many other contexts. By the end of the day we had working mixture-of-Gaussian code and could generate fake images.


Singer, Sigworth

Who walked in to the SCDA today but Amit Singer (Princeton), a mathematician who works on Cryo-EM and related problems! That was good for Greengard and me; we bombarded him with questions about how the system works, physically. Of course he knew, and showed us a beautiful set of notes from Fred Sigworth (Yale) explaining the physics of the situation. It is beautifully situated at the interface of geometric and physical optics, which will make the whole thing fun.


tomography, detailed balance, exoplanet non-parametrics

After I started to become convinced that we might be able to make progress on the Cryo-EM problem, Greengard described to me an easier (in some ways) problem: cryo-electron tomography. This is similar to Cryo-EM, but the experimenter controls the angles! In principle this should make it easier (and it does), but according to mathematical standards the problem is still ill-posed: The device can't be turned to all possible angles, or not even enough to fill out the tomographic information needed for general reconstructions. Of course this doesn't phase me!.

Useful conversations with Foreman-Mackey included two interesting subjects. One is that even if you have K different MCMC moves, each of which satisfies detailed balance, it is not guaranteed that a deterministic sequence of them will satisfy detailed balance! That blew me away for a few minutes but then started to make sense. Ish. According to Foreman-Mackey, there is a connection between this issue and the point that a product of symmetric matrices will not necessarily be symmetric!

The other interesting subject with Foreman-Mackey was on exoplanet system modeling. We want to explore some non-parametrics: That is, instead of considering the exoplanet population as a mixture of one-planet, two-planet, three-planet (and so-on) systems, model it just with K-planet systems, where K is very large (or infinite). This model would require having a significant amount of the planet-mass pdf at very low masses (or sizes). Not only might this have many technical advantages, it also accords with our intuitions: After all, I think we (almost) all think that if you have one or two major planets, you probably have many, many minor ones. Like way many.


streams, rhythm, geometry

At group meeting, Ana Bonaca (Yale) told us about inferring the potential and mass distribution in the Milky Way using pairs of cold stellar streams. She seems to find—even in the analysis of fully simulated data sets—somewhat inconsistent inferences from different streams. They aren't truly inconsistent, but they look inconsistent when you view only two parameters at a time (because there are many other nuisane parameters marginalized out). She shows (unsurprisingly) that radial velocity information is extremely valuable.

Brian McFee (NYU) talked about measuring rhythm in recorded music. Not tempo but rhythm. The idea is to look at ratios of time lags between spectrogram features (automatically, of course). He showed some nice demonstrations with things that are like "scale transforms": Like Fourier transforms but in the logarithm of frequency.

In the afternoon, Bonaca, Foreman-Mackey, and I discussed the relationships between dynamics and geometry and statistics. I gave a very powerful argument about why sampling is hard in high dimensions, and then immediately forgot what I said before writing it down. We discussed new MCMC methods, including Foreman-Mackey's proposals for Hamiltonian MCMC in an ensemble.


deproject all the galaxies!

In the morning at the Simons Center for Data Analysis, Brian Weitzner (JHU) gave a nice talk about determining structures of antibodies, and predicting structures of new or novel antibodies for drug design and so on. I learned many things in the talk, one of which is that there are only 100,000 known protein structures. That might sound like a big number, but compare it to the number of known protein sequences!. Another thing I learned is that it is often possible to get a good guess at a structure for a (part of a) protein by looking at the (known) structures of similar sequences. The cool thing is that they (the large community or communities) have developed very good “energy functions” or scores for ranking (and thus optimizing) conformations; these are obviously approximations to the real physics, but seem to work extremely well.

In the rest of my research day, I batted around issues with diffraction microscopy, ptycography, and cryo-electron microscopy imaging with Leslie Greengard. We talked about a very non-trivial issue for the first two, which is that there can be very different functions with identical squared-moduli of their Fourier transforms, or at least we think there must be. There sure are in one dimension. What about in three dimensions? We talked about an exciting prospect for the latter, which is that the cryo-EM problem is very similar to an old idea I had of deprojecting all galaxy images ever taken. “Cryo-EM for galaxies!”


sabbatical planning

I spent my research time today planning for my sabbatical. I have too many things to write, so I have to cut down the list. How to cut it down? I have never been good at anything other than “Look: squirrel!” (Meaning: ultra short-term decision making!)


computational astrophysics

I participated today in a panel at the Simons Foundation about the present and future of computational astrophysics. We talked about challenges in simulating the Universe, the Milky Way, stars, and exoplanets (formation and atmospheres). We also talked about challenges in comparing theory to data, where the theory is computational and the data sets are large. We also talked about the point that sometimes computation can be used to solve fundamental physics and applied math questions. We all agreed that an Advanced-LIGO detection of a gravitational radiation source could have a huge impact on what we all think about. It was a great day, with participation from Bryan (Columbia), Hernquist (Harvard), Ho (CMU), MacFadyen (NYU), Spergel (Princeton), Stone (Princeton), and Quataert (Berkeley) all participating, plus people from the Simons Foundation.


transforms of mixtures of Gaussians

I spent the day writing code to create mixtures of Gaussians and (importantly) their Fourier transforms. I can't count the number of times I have written mixtures-of-Gaussians code! But each use case is at least slightly different. Today the application is diffraction microscopy. I want to explore bases other than the standard grid-of-pixels basis.

The funny thing about the diffraction-microscopy problem is that it is simultaneously trivial and impossible: It is to infer all the phases of the Fourier transform given only a noisy, censored measurement of its square modulus. All the approaches that work apply very informative priors or regularization. My biggest concern with them is that they often put the most informative part of the prior on the space outside the object. Hence my goal of using a basis that is compact to begin with.

As a teaser and demo, here is an unlabeled set of figures that “test” my code:


space, space, sports, and black holes

Group meeting today was a pleasure. Laura Norén (NYU) talked about ethnography efforts across the Moore–Sloan Data Science Environments, including some analysis of space. This is relevant to my group and also the NYU Center for Data Science. She talked also about the graph of co-authorship that she and a team are compiling, to look at the state of data-science collaborations (especially interdisciplinary ones) before, during, and after the M-S DSE in the three member universities and also comparison universities. There was some excitement about looking at that graph.

Nitya Mandyam-Doddamane (NYU) showed us results on the star-formation rates in galaxies of different optical and ultraviolet colors. She is finding that infrared data from WISE is very informative about hidden star formation, and this changes some conclusions about star formation and environment (local density).

Dun Wang talked about how he is figuring out the pointing of the GALEX satellite by cross-correlating the positions of photons with the positions of stars. This fails at bright magnitudes, probably because of pile-up or saturation. He also showed preliminary results on the sensitivity of the detector, some of which appear to be different from the laboratory calibration values. The long-term goal is a full self-calibration of the satellite.

Dan Cervone (NYU) spoke about statistics problems he has worked on, in oceans and in sports. We talked about sports, of course! He has been working on spatial statistics that explain how basketball players play. We talked about the difference between normative and descriptive approaches. Apparently we are not about to start a betting company!

Daniela Huppenkothen spoke about the outburst this summer of V404 Cygni, a black hole that hasn't had an outburst since 1989. There are many observatories that observed the outburst, and the question (she is interested in) is whether it shows any oscillation frequencies or quasi-periodic oscillations. There are many spurious signals caused by the hardware and observing strategies, but apparently there are some potential signatures that she will show us in the coming weeks.


strong priors for diffraction microscopy, Jupiters

After I asked for more, Greengard sent me two classic papers (here and here) on diffraction imaging (microscopy). These are beautifully written and very clear. So I now understand the problem well, and the standard solution (which is “oversample by a factor of two!”). One interesting issue is that in real experiments a beam stop kills your low-k modes, so you don't get to see the zero (or near-zero) part of the Fourier Transform. Most of the heavy lifting in standard approaches is done by setting a zero or null region around the object and requiring that the function go to zero there. That strikes me as odd, and only applicable in some circumstances. So I became inspired to try some complementary approaches.

The day also included a conversation with So Hattori, who is going to re-start our search for very long-period planets in the Kepler data. This is an important project: It is Jupiter, not Earth, that is the driver of the present-day Solar System configuration.


diffraction imaging

Leslie Greengard put me onto thinking about diffraction imaging of non-periodic (non-crystal-lattice) objects. I did some reading and slowly started to get some understanding of how people think about this problem, which is (mathematically) that you observe (something like) the squared-modulus of the Fourier transform of the thing you care about. You have to reconstruct all the (missing) phase information. This requires very strong prior information. Late in the day I had a long conversation with Foreman-Mackey about this and many other things.


long-period planets

I went off the grid for the weekend, but that didn't stop me from working out a probabilistic approach to understanding the population of long-period planets given low signal-to-noise radial-velocity data. This problem was given to me by Heather Knutson (Caltech), who has a great data set, in which many stars show no companion at high detection confidence. The problem is hairy if you permit any star to have any number of long-period planets. It becomes much more tractable if you assume (by fiat) that every star has exactly one or zero. However this assumption is clearly unphysical!



I spent most of the day at Columbia's Data Science Institute, participating in a workshop on data science in the natural sciences. I learned a huge amount! There were way too many cool things to mention them all here, but here are some personal highlights:

Andrew Gelman (Columbia) talked about the trade-off between spatial resolution and what he called “statistical resolution”; he compared this trade-off to that in political science between conceptual resolution (the number of questions we are trying to ask) and statistical resolution (the confidence with which you can answer those questions). He also talked about distribution (or expectation) propagation algorithms that permit you to analyze your data in parts and combine the inferences, without too much over-working.

Joaquim Goes (Columbia) talked about ocean observing. He pointed out that although the total biomass in the oceans is far smaller than that on land, it cycles faster, so it is extremely important to the total carbon budget (and the natural sequestration of anthropogenic carbon). He talked about the Argo network of ocean float data (I think it is all public!) and using it to model the ocean.

John Wright (Columbia) pointed out that bilinear problems (like those that come up in blind deconvolution and matrix factorization and dictionary methods) are non-convex in principle, but we usually find good solutions in practice. What gives? He has results that in the limit of large data sets, all solutions become transformations of one another; that is, all solutions are good solutions. I am not sure what the conditions are etc., but it is clearly very relevant theory for some of our projects.

There was a student panel moderated by Carly Strasser (Moore Foundation). The students brought up many important issues in data science, one of which is that there are translation issues when you work across disciplinary boundaries. That's something we have been discussing at NYU recently.


Blanton-Hogg group meeting

Blanton–Hogg group meeting re-started today. We began the introduction process that will take at least two meetings! I talked about inferring stellar ages from spectra.

Boris Leistedt spoke about large-scale structure projects he did this summer, including one in which they are finding a good basis for doing power spectrum estimation with weak lensing. He also talked about large-scale structure work he is doing in DES.

Kopytova spoke about her project with me to infer cool-star parameters, marginalizing out issues with calibration or continuum normalization. Blanton asked her to interpret the calibration inferences; that's a good idea!

Chang Hoon Hahn spoke about comparisons of two-point and three-point functions measured in SDSS-IV data with different fiber-collision methods, and the same in simulations. This led to our monthly argument about fiber collisions, which is a problem that sounds so simple and is yet so hard! The problem is that when two galaxies are very nearby on the sky, there is a reduced probability for getting the redshift of one of them. If left uncorrected, these missing data screw up two-point statistics badly (and one-point statistics less badly). Dan Cervone suggested that we look at the missing data literature for guidance. The big issue is that we don't have a good model for the spatial distribution of galaxies, and the best corrections for missing redshifts depend in detail on that spatial distribution! I suggested some data-driven approaches, but there might not be enough data to test or implement them.

Vakili spoke about using likelihood-free inference to do cosmological large-scale structure projects, which currently use a wrong (and difficult to compute) likelihood function. He and Hahn are collaborating on a test-bed to get this kind of inference working; we discussed their first results.

Late in the day, I spoke with Matt Kleban (NYU) about a statistical theory he has for the distribution (number per class) of "classified objects". He seems to have a model with almost no parameters that does a good job of explaining a wide range of data. I discussed with him some methods for testing goodness of fit and also model comparisons with other kinds of models.


first day at the SCDA

Today was my first day at the Simons Center for Data Analysis. I spent a good chunk of the morning discussing projects with Leslie Greengard (SCDA), the director of the place. We discussed cryogenic EM imaging of molecules, in which you get many shadows of the same molecule at different positions and orientations. The inference of the three-dimensional structure involves inferring also all the projection angles and offsets—or does it?. We discussed the problem of neuron spike sorting, where potential events observed from an array of electrodes (inserted in a brain) are assigned to neurons. How do you know if you got it right? Many algorithms are stable, but which are correct?

Later in the day, Mike O'Neil joined us and we discussed (among other things) non-negativity as a constraint on problem solving. This is an amazingly informative constraint. For example, if you are doing image reconstruction and you have an array of a thousand by a thousand pixels, the non-negativity constraint removes all but one part in two to the millionth power of your parameter space! That is more informative than any data set you could possibly get. Greengard didn't like this argument: It is a counting argument! And I was reminded of Phillip Stark's (Berkeley) objection to it: He said something to the effect of “No, the constraint is much stronger than that, because it applies everywhere in space, even where you haven't sampled, so it is really an infinite number of constraints”. Greengard showed us some magic results in diffractive imaging where it appears that the non-negativity constraint is doing a huge part of the heavy lifting. This all relates to things I have discussed previously with Bernhard Schölkopf and his team.


the entire Milky Way disk!

In yet another low-research day, I talked to Taisiya Kopytova about her projects and her upcoming job search. I also talked to Ness about our APOGEE-3 white paper with Jonathan Bird. I just want to have spectra of all the red giants in the entire disk. Is that too much to ask?


exoplanet false positives; emcee3

In a low-research day, the highlight was a long discussion of current projects with Foreman-Mackey, who is now settled in Seattle. We discussed exoplanet search and populations inference with K2 data. One test of the false-positive rate (the part not generated by easy-to-model astrophysical sources) is to invert the lightcurve and search again. And when you do, sure enough, you find small, long-period exoplanets! That is the beginning of an empirical approach to understanding stellar-variability-induced and instrumental false positives. We have also talked about reordering the time-series data, but for long-period planets, this doesn't cover all bases. There are many improvements to Foreman-Mackey's search since his first K2 catalog paper: New basis for the systematic effects, built with the PCP (the robust PCA); new prior on principal component amplitudes for more informative marginalization; automated vetting that makes use of probabilistic models for model comparison. It all looks like it is working.

We also talked about the next generation of the emcee code, dubbed emcee3. The idea is that it is an ensemble sampler, but the stretch move (the basis of emcee now) becomes just one of a huge menu of update steps. We discussed schedules of update methods in mixed schemes. We discussed the problem, after a run with a non-trivial schedule, of figuring out which updates worked best. We discussed what statistics you want to keep for an ensemble sampler to assess performance in general. Homework was assigned.


bad start

Today was the first day of my sabbatical. I got no research done at all!