Astrostatistics and Data Mining, day 2

This morning we had the lab session; I had the students write from scratch a cross-validation code for model comparison. The model and optimizer we were using were not so good, so some of the code didn't work very well, but I learned a lot. I hope the students learned something. We tried to figure out if HD104067 has one companion or two. It was a pleasure to have Robert Lupton (Princeton) in the room, as he was tough on my code and also interjected some very useful stuff. Just before the lab session, Aigrain talked about Gaussian processes, showing some absolutely amazing examples from Rasmussen's book.

In the afternoon, there were many excellent talks; too many to mention individually. One highlight was Kitaura (Garching), who is very close to achieving the goal I have of performing properly marginalized measures of cosmological density fields. He might be able to take us away from point estimates of the correlation function! Ascasibar (Madrid) showed a very nice visualization of the positions of galaxies (relative to one another) in spectrum space. He is using some simple tools and finding some nice structure; much of the discussion after his talk was about the possibility of finding sharper tools and maybe even more dramatic results. Matthew Graham (Caltech) gave me a shout-out about not trusting meta-data and only working at the raw pixels (when I can). Christlieb (Heidelberg) showed some astounding spectra from large surveys and discussed how hard it might be to find and understand oddities in the next generation of million-spectra surveys. He included APOGEE in his list of important new surveys, to my pleasure.

There has been a bit of Bayes vs non-Bayes arguing at the meeting, and I have tried not to enter into it. I am a Bayesian, but I am also a pragmatist (and I am also willing to concede that its subjectivity is an issue for many, and not unreasonably). In the long run, we should be optimizing our long-term future discounted free cash flow.


Astrostatistics and Data Mining, day 1

I gave morning lectures at the summer-school part of this meeting on La Palma, on model specification and model choice. I argued that a model is an approximate specification of the probability of the data. I argued for cross-validation, or, when your priors are informative (which, in astronomy, they very rarely are), the Bayesian evidence. Tomorrow we pair-code.

I was followed by Suzanne Aigrain (Oxford), who talked about time-domain methods, especially ones that permit modeling of stochastic processes. She is leading up to Gaussian processes tomorrow, which is highly relevant to my conversations with exoplanet hunters and with Schiminovich about modeling the intensity variations in eclipsing stars.

In the afternoon the workshop part of the meeting started. One highlight for me was that Anthony Brown (Leiden) spent a good part of his talk about Gaia data on Lang and my proposals for a probabilistic catalog. He is committed to having the raw data and the processing pipelines preserved and curated. His message was simple: If you have the best survey and it is going to stay best for many years (decades in the case of Gaia), then re-analysis of your data will be an important capability for the community. Brown was followed by Berry Holl (Lund), who gave an update on his work to make hypothesis testing possible with the full Gaia-catalog covariance matrix, which is far too large to represent even on disk.

Later in the afternoon, the Heidelberg Gaia group showed extremely good results from support vector machines. As my loyal reader knows, I don't like black-box methods like SVM, but they sure do work well.


models and selection

Spent flights to the Canary Islands working on models and model selection for a summer school here, sponsored by the Gaia community. I fit a bunch of models to some radial velocity data supplied by Suzanne Aigrain (Oxford), but then the good models beat the bad models by so many orders of magnitude, model selection is trivial. Debating what examples to use in the coding sessions. For now all I need to do is prepare a few hours of lecture and discussion material.


image differencing for lensing

I took another break from vacation to meet up with Marshall and Lang in Princeton, where Marshall is visiting. We wrote (or re-wrote) an image differencing code that finds the best mutual convolution kernel by least-square fitting in a very general (free) basis. It works well, but is not super-fast. Hey reader: What are the best test cases for this? And how do we know if we have beaten or matched the industry standard?


dark-matter annihilation

[No posts for a while because after that proposal submission I took a short research vacation. That didn't completely stop me from doing research, of course:]

I read carefully and sent comments to Dmitry Malyshev on his paper on dark-matter annihilation. He does get a signal, but he calls it an upper limit to be conservative. The signal has an interesting amplitude; is it the dark matter? Probably not!


transient literature

I reviewed the literature on transient surveys and projects today. I learned that PTF, CRTS, and Palomar-QUEST (all projects at or near or involving Caltech, I note) all have a very simple design: Find epochs that are significantly out-of-line, follow up, and classify with follow-up observations. It is shoot first, ask questions later, almost literally. This is good because (a) you get the low-hanging fruit very fast, and e (b) you aren't model-dependent in your selection. It means, however, that there are necessarily many un-mined and un-discovered events in their data streams, because model-fitting and model-selection methods (when the models are reasonable) are always more sensitive. It also means that Kepler is in many ways more analogous to the projects we are proposing with GALEX.


proposal summary

Schiminovich had the good idea that we summarize our proposal in a table. I did that today, and it helped to be sure. More work is needed!


eclipse shapes

I played hooky from my proposal-writing to write a short piece of code that makes a first-order model of a stellar eclipse, to perform more precise fitting for the white dwarf transits we have found. I have yet to plug the model into our data analysis.


priors on fluxes and redshifts

Fadely came in for the day; we discussed how to put objective priors on fluxes and redshifts for galaxies in our hierarchical star–galaxy separation project. In the end we decided to make the prior separable for now. We know this to be wrong, but we want to see if it becomes a dominant problem. in my view, all models are wrong; the only question is whether they are wrong in ways that make your conclusions unreliable! Fortunately, this is an empirical question; we shall see.


what is an image?

Lang and I spent the morning on the early pages of our paper on how to detect a source in multi-epoch imaging. As I have mentioned before, this turns out to be a non-trivial question if you take it very seriously. In order to clarify our answer to the question, we decided to be very specific about our model of an image, starting at the intensity field impinging on the telescope. This is clarifying, to the point that we understood the problem better at the end of our discussions.

I am still a bit confused about the different questions you can be asking: Do you want to know if there is a source in the image, or do you want to know the flux of the source given a proposed position, or do you want to know the position of the source given a proposed flux (or flux limit)? These are all different questions with different answers. Really in most cases that people are detecting sources they want to know Is there a source near this position that I ought to include in my catalog? This question has never been properly answered, to my knowledge, in part because it is ill-posed: If you think that flux and position information is always probabilistic (as I do), then there is no fact of the matter about whether any particular source meets catalog requirements (and I am leaving aside whatever is meant by the word ought).


nothing, really

I would like to say I worked on proposal text today, but I didn't. I did nothing of value (at least according to The Rules)!



I got a bit of proposal writing done today, but mainly during a single uptown subway journey.


photon statistics

I spent my research time today working out things about photon statistics relevant to Schiminovich and my proposal. Yes, as I expected, unbinned photons are more constraining than—or in special cases equally constraining as—binned photons for comparing hypotheses. That is, as Rix says, binning is sinning.


star formation, CMB at high ells

Guangtun Zhu (NYU) successfully defended his PhD today. He did much of the spectral extraction and analysis for the enormous PRIMUS survey, which has more than 105 good redshifts but was done in 39 nights of Magellan time with a small team. He also can show clearly that about 15 percent of the galaxies on the red sequence in fact have active star formation, and many other things. It was a pleasure, and congratulations Dr Zhu!

In the afternoon, Lyman Page (Princeton) talked about the current and future of CMB experiments. He talked quite a bit about the promise for secondary anisotropies and the gravitational wave background, but the thing that blew me away was the ACT and SPT data he showed: There are clearly 5 peaks in the spherical fourier transform and two more if you squint. It is absolutely incredible. At very high ells they are dominated by dusty galaxies (and their systematics are non-trivial), which also is extremely interesting for the future of galaxy astrophysics.


anti-protons, stellar halos, Dr Zolotov

In the morning Grant Christopher (NYU) successfully defended his PhD on TeV cosmic rays from Milagro. By cleverly using the Moon's shadow, he can take a spectrum of the cosmic rays, and measure the relative contributions of different species. By this method he has the world's best limit on the anti-proton flux at TeV energies. Lots more can be done; it is a beautiful technique, using the Moon as a kind of inverse pinhole camera!

In the afternoon Adi Zolotov defended her PhD, on the stellar halos of galaxies in realistic simulations. She shows that stellar halos generically have multiple chemically distinct populations, and some of the stars in the halo were born in the host galaxy and then flung out. She also generally finds that there is radial mixing (as there is in the disk) in the halo. At the same time, she makes robust testable predictions, even though she has to live with pragmatic and therefore uncertain simulations: She finds the aspects of the simulations that are most reliable. It is a great piece of work, and although I am her advisor on paper, Willman is her real advisor; it was also a pleasure to have her here for the defense.

With all these thesis defenses, including those of my two great students Bovy and Zolotov, it has been a great week. Have I mentioned recently that I love my job?



Wrote proposal text with Schiminovich at our undisclosed location, secured against bad coffee.


Dr Bovy

It is with great pleasure that I announce the membership of Jo Bovy in the community of scholars! He gave a nice summary of his dissertation—trimmed to the one-third of his remarkable publication list that was about dynamics in the Galaxy—and spoke about the future, which looks pretty interesting. He showed his remarkable results on the Milky Way disk velocity-space substructure, and he showed that we will learn a lot about the dynamics of the Milky Way disk from the APOGEE survey, even in the first few nights!


money, pulsars, and iron

In the morning, Schiminovich and I worked on a grant proposal in an undisclosed location. Hint: They have great coffee and great burgers. Over lunch, Andrei Gruzinov (NYU) regaled us with his strong-field electrodynamics, which is a finite-dissipation version, more or less, of force-free electrodynamics. In the afternoon, Jeff Allen (NYU) successfully defended his PhD dissertation about the composition (among other things) of ultra-high energy cosmic rays; by multiple methods with different systematics he finds 60/40 iron/protons, which is maximally confusing, at least to me!