In a low-research day, I got my first view of the new location of the NYU Center for Data Science, in the newly renovated building at 60 Fifth Ave. The space is a mix of permanent, hoteling, and studio space for faculty, researchers, staff, and students, designed to meet very diverse needs and wants. It is cool! I also discussed briefly with Daniela Huppenkothen (NYU) the scope of her first paper on the states of GRS 1915, the black-hole source with extremely complex x-ray timing characteristics.
I have spent part of the summer working with Megan Bedell (Chicago) to see if there is any evidence that radial velocity measurements with the HARPS instrument might be being affected by calibration issues or helped by taking some kind of hierarchical approach to calibration. We weren't building that hierarchical model, we were looking to see if there is evidence in the residuals for information that a hierarchical model could latch on to. We found nothing, to my surprise. I think this means that the HARPS pipelines are absolutely awesome. I think they are closed-source, so we can't do much but inspect the output.
Given this, we decided to start looking at stellar diagnostics—if it isn't the instrument calibration, then maybe it is actually the star itself: We need to ask whether we can we see spectral signatures that predict radial velocity. This is a very general causal formulation of the problem: We do not expect that a star's spectrum will vary with the phase of an exoplanet's orbit (unless it is a very hot planet!), so if anything about the spectrum predicts the radial velocity, we have something to latch on to. The idea is that we might see the spectral signature of hot up-welling or cold down-welling at the stellar surface. There is much work in this area, but I am not sure than anyone has done anything truly data driven (in the style, for example, of The Cannon). We discussed first steps towards doing that, with Bedell assigned plotting tasks, and me writing down some methodological ideas.Over lunch, Boris Leistedt and I caught up on all the various projects we like to discuss. He has had the breakthrough that—if you build a proper generative model for galaxy imaging data—you don't need to have spectroscopic training sets, nor good galaxy spectral models, to get good photometric redshifts. The idea is that once you have multi-band photometry, you can predict the appearance of any observed galaxy as it would appear any other redshift using a flexible, non-parametric SED model that isn't tied to any physical galaxy model. The idea is that we use all of, but only, what we believe about how the redshift works, physically. Most machine-learning methods aren't required to get the redshift physics right, and most template-based models assume lots of auxilliary things about stars and stellar populations and dust. We also realized that, if done correctly, this method could subsume into itself the cross-correlation redshifts that the LSST project is excited about.
I had the pleasure today of reading two draft papers, one by Dun Wang on our alternative to difference imaging based on our data-driven pixel-level model of the Kepler K2 data, and the other by Huanian Zhang (Arizona) on H-alpha emission from the outskirts of distant galaxies. Wang's paper shows (what I believe to be) the most precise image differences ever created. Of course we had amazing data to start with! But his method for image differencing is unusual; it doesn't require any model of either PSF nor the difference between them. It just empirically figures out what linear combinations of pixels in the target image predict each pixel in the target image, using the other images to determine these predictor combinations. It works very well and has been used to find microlensing events in the K2C9 data, but it has the disadvantage that it needs to run on a variability campaign; it can't be run on just two images.
The Zhang paper uses enormous numbers of galaxy-spectrum pairs in the SDSS spectroscopic samples to find H-alpha emission from the outskirts of (or—more precisely—angularly correlated with) nearby galaxies. He detects a signal! And it is 30 times fainter than any previous upper limit. So it is big news, I think, and has implications for the radiation environments of galaxies in the nearby Universe.
My research highlight today was a conversation with MJ Vakili about the paper he wrote this summer about halo occupation and what's known as “assembly bias”. Perhaps the most remarkable thing about contemporary cosmology is that the dark-matter-only simulations do a great job of explaining the large-scale structure in the galaxy distribution, despite the fact that we don't understand galaxy formation! The connection is a “halo occupation function” that puts galaxies into halos. It turns out that incredibly simple prescriptions work.
I have always been suspicious about halo occupation, because galaxy halos are not fundamental objects in gravity or cosmology; they are defined by a prescription, running on the output of a simulation. That is, they are just effective crutches, used for convenience. There was no reason to put any reality onto a halo (or a sub-halo or anything of the sort). Really there is just a density field! However, empirically, the halo description of the Universe has been both easy and useful.
Now that cosmology is seeking ever higher precision, work has started along the lines of asking what halo properties (mass, velocity amplitude, concentration, and so on) are relevant to the galaxies that form within them. The answer from the data seems to be that mass is the main driving factor. The community has expected a bias or occupation that depends on the time of formation of the halo (which itself relates to he halo concentration parameter). Vakili has been testing this, and the main punchline is that if the effect is there, it is a small one! It is a great result and he is nearly ready to submit.
My question is: Can we step out of the halo box and consider all the ways we might put galaxies into the dark-matter field? Could the data tell us what is most relevant?
In a day short of research (because: getting ready to teach again!), I spent some time working with the Simons Foundation to prepare for the #GaiaSprint, which is coming up in 8 weeks. After that I had lunch with Ekta Patel (Arizona), who has been working on the dynamics of the Local Group, and especially understanding the orbits of M31 and M33.
In the morning I had a long and overdue conversation with Alex Malz, who is attempting to determine galaxy one-point statistics given probabilistic photometric redshift information. That is, each galaxy (as in, say, the LSST plan and some SDSS outputs) is given a posterior probability over redshifts rather than a strict redshift determination. How are these responsibly used? It turns out that the answer is not trivial: They have to be incorporated into a hierarchical inference, in which the (often implicit) interim priors used to make the p(z) outputs is replaced by a model for the distribution of galaxies. That requires (a) mathematics of probability, and (b) knowing the interim priors. One big piece of advice or warning we have for current and future surveys is: Don't produce probabilistic redshifts unless you can produce the exact priors too! Some photometric redshift schemes don't even really know what their priors are, and this is death.
In the afternoon, I discussed various projects with John Moustakas (Siena), around Gaia and large galaxies. He mentioned that he is creating a diameter limited catalog and atlas of galaxies. I am very interested in this, but we had to part ways before discussing further.
Coming back in from a short vacation, it was a low research day. John Moustakas (Siena) is in town this week, and we discussed the state of some of his projects. In particular, we discussed Guangtun Zhu's paper on discrete optimization for making archetype sets, and the awesomeness of that tool, which Moustakas and I intend to use in various galaxy contexts.
On the airplane home from MPIA (boo hoo!) I wrote the shortest piece of code I could that can take interim posterior p(z) redshift probability distributions from a set of galaxies and produce N(z) (and maybe other one-point statistics). I can make pathological cases in which there are terrible photometric-redshift outliers that are structured to cause havoc for N(z). But as long as you have a good generative model (and that is a big ask, I hate to admit), and as long as the providers of the p(z) information also provide the effective prior on z that was used to generate the p(z)s (another big ask, apparently), you can infer the true N(z) surprisingly accurately. This is work with Alex Malz and Boris Leistedt.
Christina Eilers (MPIA) and I spent a long time today pair-coding her extension to The Cannon in which we marginalize over the true labels of the training data, under the assumption of small, known, Gaussian, label noise. Our job was to vastly speed optimization by getting correct derivatives (gradient) of the objective function (a likelihood function) with respect to parameters, and insertion of this into a proper optimizer. We built tests, did some experimental coding, and then fully succeeded! Eilers's Cannon is slower than other implementations, but more scientifically conservative. We showed by the end of the day that the model becomes a better fit to the data as the label variances are made realistic. Stars really do have simple spectra!
While we were working, Anna Y. Q. Ho and Sven Buder (MPIA) were discovering non-trivial covariances between stellar radial Velocity (or possibly radial velocity mis-estimation) and alpha abundances, with Ho working in LAMOST data and Buder working in GALAH data. Both are using The Cannon. After some investigation, we think the issue is probably related to the collision of alpha-estimating spectral features and ISM and telluric features. We discussed methods for mitigation, which range from censoring data at one end and fully modeling velocity along with the model parameters at the other.
Late in the day, I finished my response to referee and submitted it.
At Galaxy Coffee, Ben Weiner (Arizona) gave a talk about his great project (with many collaborators) to study very faint satellites around Milky-Way-like galaxies using overwhelming force: They are taking spectra of everything within the projected virial radius! That's thousands of targets, among which (for a typical parent galaxy), they find a handful of satellites. The punchline is that the Milky Way appears to be typical in its number of satellites, though there is certainly a range.
I spoke with Glenn Van de Ven (MPIA) about the possibility that he could upgrade his state-of-the-art Schwarzschild modeling of external-galaxy integral field data to something that would do chemo-dynamics. I suggested ways that he could keep the problem convex, but use regularization to reduce model complexity. We discussed baby steps towards the goal.
I also wrote a title and abstract (paper scope) for Adrian Price-Whelan and started on the same for Andy Casey.
As always, MPIA Milky Way group meeting was a pleasure today, featuring short discussions led by Nicholas Martin (Strasbourg), Adrian Price-Whelan, and Andy Casey. Casey showed his developments on The Cannon and applications to new surveys. Price-Whelan spoke about our ability to see possible warps (coherent responses) in the Milky Way disk from interactions with satellites. Martin showed amazing color-magnitude diagrams of stars in Andromeda satellite galaxies. So. Much. Detail.
Chaos reigned around me. Jonathan Bird and Melissa Ness worked on the Disco concept. Anna Y. Q. Ho, working on a suggestion from Casey, found Li lines in (a rare subsample of) LAMOST giants, leading to a whole new insta-project on Li. Price-Whelan figured out multiple methods for initializing and running MCMC on our single-line binary stars, initializing from either the prior or from literature orbits. It looks like many (or maybe all) of the APOGEE variable-velocity stars have multiple qualitatively different but nonetheless plausible orbital solutions. Casey and I conceived of a totally new way to build The Cannon as a local model for every test-step object; a non-parametric Cannon if you wish? I spoke with Jeroen Bouwman (MPIA) about his (very promising) work using Dun Wang's Causal Pixel Model to fit the Spitzer data on transit spectroscopy for a hot Jupiter.
Anna Y. Q. Ho is in town to finish two—yes, two—papers on what can be learned about stellar properties from (relatively) low-resolution LAMOST spectroscopy. She has amazing results on ages and chemical abundances, which challenge long-held beliefs about what can be done at medium to low resolution. One of her two papers is about using C and N abundances to infer red-giant ages, as we did with APOGEE and The Cannon earlier. Ho and I met with Rix today to discuss error propagation from abundances to ages, and all the possible sources of scatter, including the unknown unknowns.
Adrian Price-Whelan started running our probabilistic inference of single-line spectroscopic binaries on the Troup et al sample. We had to complexify our noise model, since clearly there are variations larger than the error bars. We also had to reparameterize our binary-star parameters to a better set. In this process, we wanted to go from a phase angle to a time and back. Going from time to phase angle is a numerically stable mod() operation. Going from phase angle back to time can naively involve adding and subtracting huge numbers. We re-cast the function so no large subtractions ever happen. That was not totally trivial!
Late in the day, Melissa Ness and Jonathan Bird interviewed Price-Whelan about ideas potentially going into the nascent Disco proposal.
All hell broke loose in Heidelberg today, as Andy Casey got done with his meeting downtown, Jonathan Bird (Vanderbilt) showed up to work on the Disco proposal for the next big thing with the SDSS hardware, Ben Weiner (Arizona) showed up to talk science, and Anna Ho came in to finish her new set of papers about the LAMOST data. And even with these distractions, Price-Whelan and I “decided” (I use scare quotes because our decision was heavily influenced by Rix!) to work on the single-line binaries in the APOGEE data.
Price-Whelan and I joined up my celestial mechanics code from June with the simulated APOGEE single-visit velocities through a likelihood function and got MCMC sampling working. We showed that you can say significant things about binary stars even with only a few observations; you don't need full coverage of the orbit to make substantial statements. Though it sure helps if you want very specific orbital parameters! Tomorrow we will hit real data; we will have to put in a noise model and some outlier modeling (probably).
Bird and I discussed the high-level point of the Disco proposal: We need it to express, clearly, an idea (or set of ideas) that is worth many tens of millions of dollars. That's hard; the project is very valuable and will have huge impact per dollar, but crystallizing a complex project into one bullet point is never trivial.
I worked on the weekend to get my “Chemical tagging can work” paper ready for resubmission to the ApJ, incorporating referee and co-author comments, both of which made the paper much better. By Sunday it was good enough to send to the co-authors for final comments. In case it is some comfort to my loyal reader, it took me a full six months to get to this, which is embarrassing, but normal. And even then—when I sent it to the co-authors—it was missing a paragraph about the abundances in cluster M5. While Andy Casey and I were relaxing in a Heidelberg pub, he (Casey) wrote that final paragraph. I love my job!
Yesterday at Milky Way group meeting, Adrian Price-Whelan brought up the possibility that the halo might be made up of many disrupted globular clusters. Sarah Martell (UNSW) showed up today and said more along these lines, based on chemical arguments. That got me thinking about the birthday paradox: If you have 30 people in the room, you are more than likely to have two that share the same birthday. The implication of this paradox for the Galaxy is the following:
Imagine that the Milky Way halo (or even better, bulge) is made up of 1000 disrupted stellar clusters that fell in. If we look at even 100-ish stars, we would expect to find pairs of stars with identical abundances, with very good confidence. And this confidence can be kept high even if there is a smooth background of stars that doesn't participate in the cluster origin, and even if there are multiple populations in the original clusters. As long as we can show that pairs of stars are not co-eval (chemically), we can rule out all of these hypotheses with far less data than we already have, in hand. Awesome! I wrote code to check this, but am far from having a real-data test.