#hackAAS number 6

Today was the 6th annual AAS Hack Day, at #AAS231 in Washington DC. (I know it was 6th because of this post.) It was an absolutely great day, organized by Kelle Cruz (CUNY), Meg Schwamb (Gemini), and Jim Davenport (UW & WWU), and sponsored by Northrup Grumman and LSST. The Hack Day has become an integral part of the AAS winter meetings, and it is now a sustainable activity that is easy to organize and sponsor.

My hack for the day was to work on dimensionality and structure in the element-abundance data for Solar Twins (we need a name for this data set) created by Megan Bedell (Flatiron). I was reminded how good it is to bring data sets to the Hack Day! Several others took the data to play with, and Martin Henze (SDSU) went to town, visualizing (in R) the covariances (empirical correlations) among the elements. Some of these correlations are positive, some are negative, and some are tiny. Indeed, his analysis sort-of chunks the elements up into blocks that are related in their formation! This is very promising for my long-term goals of obtaining empirical nucleosynthetic yields.

What I did for Hack Day was to visualize the data in the natural (element-abundance) coordinates, and then again in PCA coordinates, where many of the variances really vanish, so the data really are low dimensional (dammit; my loyal reader knows that I don't want this to be true). And then I also visualized in random orthonormal coordinates (which was fun); this shows that the low-variance PCA space is indeed a very rare or hard-to-find subspace in the full 33-dimensional element space. I also visualized some rotations in the space, which forced me to do some 33-dimensional geometry, which is a bit challenging in a room of enthusiastic hackers!

But so much happened at the hack day. There was a project (led by aforementioned Bedell) to make interactive web-based plots of the exoplanet systems, to visualize multiplicity, insolation, and stellar properties. There was a project to find the “Kevin Bacon of astronomy” which was obviously flawed, since it didn't identify yours truly. But it did make a huge network graph of astronomers who use ORCID. Get using that, people! Erik Tollerud did more work on his hack-of-hacks to build great tools for collecting and recording hacks, but he also was working on a white paper for NASA about software licensing. I gave him some content and co-signed it. MIT for the win. Foreman-Mackey led a set of hacks in which astronomers learned how to use TensorFlow, which uses NVIDIA GPUs for insane speed-up in linear-algebra operations. Usually people use TensorFlow for machine learning, but it is a full linear-algebra library, with auto-differentiation baked in.

The AAS-WWT people were in the house, and Jonathan Fay, as per usual at Hack Days (what a hacker!), pushed a substantial change to the software, to make it understand and visualize velocity maps. Another group (including Schwamb, mentioned above) visualized sky polygons in WWT, and used a citizen-science K2 discovery as its test case for visualizing a telescope focal-plane footprint. There were nice hacks with APIs, with people learning to use the NASA Astrophysics Data System API and Virtual Observatory APIs, and getting different APIs to talk together. One hack was to visualize Julia Sets using Julia! It took the room a few minutes to get the joke, but the visualization was great, and very few lines of code in the end. And there were at least two sewing hacks.

None of this does justice: It was a packed room, about 1/3 of the participants completely new to Hack Days, and great atmosphere and energy. I love my job!


nonlinear dimensionality reduction

Today Brice Ménard (JHU) showed me a new dimensionality-reduction method by him and Dalya Baron. He claims it has no free parameters and good performance. But no paper yet!


streams and dust

At Stars Group meeting, Lauren Anderson (Flatiron) and Denis Erkal (Surrey) both spoke about stellar streams. Anderson spoke about finding them with a new search technique that looks at proper motions for stars found in great-circle segments; this is being prepared for Gaia DR2. Erkal spoke about constraining the Milky-Way potential using only configurational aspects of streams: If a small stream segment locally don't contain the Galactic Center, there must be asphericity in the gravitational potential.

Late in the day, Yi-Kuan Chiang (JHU) showed me absolutely beautiful results cross-correlating various Milky-Way dust maps with high-redshift objects. There ought to be no correlations, at least in the low-extinction regions. But there are correlations, and it is because the dust maps are all contaminated by high-redshift dust in the extragalactic objects themselves (or objects correlated therewith). He can conclude nice things about different dust-map techniques. We discussed (inconclusively) whether his work could be turned around and used to improve Milky-Way map-making.

loose talk of #aliens

[Warning: This post is not, strictly, a research post. It is a response to events in the astronomical community in the recent past.] Word on the street (I can't find out, because it is not open) is that some argument broke out on the Facebook(tm) astronomy group about loose discussion on the internets about #aliens, and things like the Boyajian star or the 'Oumuamua asteroid. Since I am partially responsible for this loose talk, here is my position:

First, I want to separate informal discussion (like on twitter or blogs) from formal discussion in scientific papers (like what might be submitted to arXiv) from press releases. These are three different things, and I think we need to treat them differently. Second, I am going to assert that it is reasonable and normal for astronomers to discuss in scientific papers (sometimes) the possibility that there is alien life or alien technology with visible impact on observations. Third, I am going to presume that the non-expert public deserves our complete respect and cooperation. If you disagree with any of these things, my argument might not appeal to you.

On the second assumption (aliens are worthy of discussion), you can ask yourself: Was it a reasonable use of telescope time to look at 'Oumuamua in the radio, to search for technological radio transmissions? If you think that this was a reasonable thing to do with our resources, then you agree with the second assumption. Similarly if you think SETI is worth doing. If you don't think these uses of telescopes are reasonable—and it is understandable and justifiable not to—then you might think all talk of aliens is illegitimate. Fine. But I think that most of us think that it is legitimate to study SETI and related matters. I certainly do.

Now if we accept the second assumption, and if we accept the third assumption (and I really don't have any time for you if you don't accept the third assumption), then I think it is legitimate (and perhaps even ethically required) that we have our discussions about aliens out in the open, visible to the public! The argument (that we shouldn't be talking about such things) appears to be: “Some people (and in particular some news outlets) go ape-shit when there is talk of aliens, so we all need to stop talking about aliens!” But now let's move this into another context: Imagine someone in a role in which they serve the public and are partially responsible for X saying: “Some people (and in particular some news outlets) go ape-shit when there is talk of X, so we all need to stop talking about X!” Obviously that would be a completely unethical position for any public servant to take. And it wouldn't just be fodder for conspiracy theorists, it would actually be evidence of a conspiracy.

Imagine we, as a community, decided to only discuss alien technology in private, and never in public. Would that help or hurt with the wild speculation or ape-shit reactions? In the long run, I think it would hurt us, and hurt the public, and be unethical to boot. Informal discussion of all matters of importance to astronomers are legitimately held in the open. We are public servants, ultimately.

Now, I have two caveats to this. The first is that it is possible for papers and press releases and news articles to be irresponsible about their discussion of aliens. For example, the reportage claiming (example here)—and it may originate in the paper itself—that the reddening observed in the Boyajian Star rules out alien megastructures was debatably irresponsible in two ways. For one, it implied that the megastructure hypothesis was a leading hypothesis, which it was not, and for two, it implied that the megastructure hypothesis was specific enough to be ruled out by reddening, which it wasn't. Indeed, the chatter on Twitter(tm) led to questions about whether aliens could ever be ruled out by observations, and that is an interesting question, which relates to the second assumption (aliens are worthy of discussion) given above. Either way, the paper and resulting press implied that the observational result constrained aliens, which it did not; the posterior probability of aliens (extremely low to begin with) is almost completely unchanged by the observations in that paper. To imply otherwise is to imply that alien technology is a mature scientific hypothesis, which it isn't.

Note, in the above paragraph, that I hold papers and press releases to a higher standard than loose, informal discussion! That is my first assumption, above. You might disagree with it, but note that it would be essentially completely chilling to all informal, open discussion of science if we required refereed-publication-quality backing for anything we say, anywhere. It would effectively re-create the conspiracy that I reject above.

I don't mean to be too critical here, the Boyajian-star paper was overall extremely responsible and careful and sensible. As are many other papers about planet results, even ones that end up getting illustrated with an artist's impression of a rocky planet with ocean shores and/or raging surf. If I have a complaint about exoplanet science as a community (and I count myself a member of this community; I am not casting blame elsewhere), it is about the paper-to-press interface, where artist's conceptions and small signals are amplified into luscious and misleading story-time by perfectly sensible reporters. We (as a community, and as a set of funded projects) are complicit in this.

The second caveat to what I have written above is that I (for one) and many others talk on Twitter(tm) with tongue in cheek and with sarcasm, irony, and exaggeration. It takes knowledge of the medium, of scientists, and of the individuals involved to decode it properly. When I tweeted that it was “likely” that 'Oumuamua was an alien spaceship, I was obviously (to me) exaggerating, for the purposes of having a fun and interesting discussion. And indeed, the asteroid looks different in color, shape, and spin rate (and maybe therefore composition and tensile strength) from other asteroids in our own Solar System. But it might have been irresponsible to use my exaggeration and humor when it comes to aliens, because aliens do set off some people, especially those who might not know the conventions of scientists and twitter. I take that criticism, and I'll try to be more careful.

One last point: The underlying idea of those who say we should keep alien discussion behind closed doors (or cut it off completely) is at least partly that the public can't handle it. I find that attitude disturbing and wrong: In my experience, ordinary people are very wise readers of the news, with good sense and responsibility, and they are just as good at reading arguments on Twitter(tm). The fact that there are some exceptions—or that the Daily Mail is an irresponsible news outlet—does not change the truth of my third assumption (people deserve our respect). We should just ignore and deprecate irresponsible news, and continue to have our discussions out in the open!

In the long run, astronomy will benefit from open-ness, honesty, and carefully circumscribed reporting of goals and results. We won't benefit from hiding our legitimate scientific discussions from the public for fear that they will be mis-interpreted.


stream search

Lauren Anderson (Flatiron) and Vasily Belokurov (Cambridge) have been developing a (relatively) model-free search for compact structures in phase-space, to be used on Gaia DR2 in April. Anderson showed me the current results, which highlight many possible streams, in pre-Gaia test data from SDSS (created by Sergey Koposov at CMU). She requires that the phase-space over-density be consistent with a section of a great circle, at a (somewhat) well-defined Galactocentric distance (because: reflex proper-motion from the Sun's velocity), moving in a direction along the circle. She finds lots of putative streams, but many of them seem to highlight edges and issues with the spatial selection function, so there are still issues to work out. The nice thing is that if we can get something working here, it will crush on the forthcoming Gaia data, which will have a much simpler selection function.


hand-written optimizers

Megan Bedell (Flatiron) and I are working on the optimizers underlying our stellar radial-velocity determination pipeline, code-named wobblé. There were serious bugs! But by the end of the day it looked like everything was optimizing properly.
Note to self: Don't hand-write an optimizer unless it is absolutely necessary!


M-dwarf spectral models; time-variable spectra

I spoke to Jessica Birky (UCSD) today about her #AAS231 poster on using The Cannon to label M-dwarf spectra in the APOGEE spectra. She has beautiful results, using spectral types from the Burgasser group, and using physical labels (temperatures and compositions) from Andrew Mann (Columbia). We discussed things to emphasize on the poster and things to emphasize in discussions with people who come by. For my loyal reader: It will be up at #AAS231 poster session 349 on Thursday January 11.

I also spent some time working on a set of issues around measuring precise radial velocities for stars in the presence of time-variable spectral features in both the star and the tellurics. I worked out derivatives for the spectral model when the telluric absorption is permitted to come from a low-dimensional subspace of spectrum space. I then turned my attention to the ill-posed problem of determining precise radial velocities when the star changes its spectral shape (or line strengths or line positions, etc). In the case that the stellar spectrum changes completely randomly, and independently or separably from the stellar velocity, I believe (oddly) that the problem is going to be easy. The problem is that it won't change completely separably: There will be stellar surface variations that co-vary with spectral changes. This is the reality, and the hardest case, I think.


all papers are about four things

It is pathetic how little work I got done today, given that the International Date Line made today 45 hours long for me. I did get some writing done in my machine-learning opus, and figured out a way to re-structure it to make it more readable. I also thought a bit about the following thing, which came up in discussion with Bedell:

Scientific papers in the astronomy literature are always, and unavoidably, about four things: They are about the Universe or astrophysical phenomena. They are about the data we have on those phenomena. They are about the astronomical literature itself. And they are about the authors themselves. You can't avoid talking about yourself when you write a paper because: subjectivity. All data analyses, frequentist or Bayesian, involve subjective decision-making. You can't avoid talking about the literature, because: relevance. Besides, astronomy is the astronomical literature, in my opinion. And you can't avoid talking about the data, because: accuracy. That is, you must be responding to data in some way or other, even if you are a theorist. And you can't avoid the universe, because otherwise it isn't astronomy!


linear regression and the kernel trick

I spent the day in an undisclosed location, working on linear regression. That is, I am working on the notation and description for linear regression for my opus on machine learning in astronomy. Mid-way through getting it all together, I started to lose faith that I even know what linear regression is—or that what machine learners call linear regression is what I call linear regression! But in my undisclosed location, I don't have a copy of Bishop!

I do have Wikipedia, however, and I spent some time there, reading different descriptions and learning new applications of the kernel trick. I think of it as some kind of “lifting” of the problem to a (far) larger space, but it can also seen as a redefinition of “proximity” or “similarity” in the data space. That makes sense, because (at base) the kernel trick is a redefinition of the dot product. Stuff to think about, and relevant to many machine-learning methods. In particular, when you apply it to linear regression, you get (more or less) the Gaussian Process.


machine learning for astronomers

I spent a bit of vacation time writing in long-term writing projects. The one I found myself wanting to work on is a long-term project called (for now) “Machine learning for astronomers”, in which I go over the basics, and give contextualized (and unsolicited) advice for using machine-learning methods in astrophysics. One of my principal goals is to criticize many uses that fall into the estimator category, and promote methods that can be built into larger, probabilistic inferences. This deprecates most uses of deep learning, and encourages Gaussian Processes. Interestingly, generative adversarial networks (the new rage) are good in this dichotomy, because they are generators that transform probability densities. But I am starting small, working through the detailed mathematics of five methods which I think are so beautiful and simple, everyone should know them.


testing tacit knowledge about measurements

In astrometry, there is folk knowledge (tacit knowledge?) that the (best possible) uncertainty you can obtain on any measurement of the centroid of a star in an image is proportional to the with the size (radius or diameter or FWHM) of the point-spread function, and inversely proportional to the signal-to-noise ratio with which the star is detected in the imaging. This makes sense: The sharper a star is, the more precisely you can measure it (provided you are well sampled and so on), and the more data you have, the better you do. These are (as my loyal reader knows) Cramer–Rao bounds. And related directly to Fisher information.

Oddly, in spectroscopy, there is folk knowledge that the best possible uncertainty you can obtain on the radial-velocity of a star is proportional to the square-root of the width (FWHM) of the spectral lines in the spectrum. I was suspicious, but Bedell (Flatiron) demonstrated this today with simulated data. It's true! I was about to resign my job and give up, when we realized that the difference is that the spectroscopists don't keep signal-to-noise fixed when they vary the line widths! They keep the contrast fixed, and the contrast appears to be the depth of the line (or lines) at maximum depth, in a continuum-normalized spectrum.

This all makes sense and is consistent, but my main research event today was to be hella confused.


optimization is hard

Megan Bedell (Flatiron) and I worked on optimization for our radial-velocity measurement pipeline. We did some experimental-coding on scipy optimization routines (which are not documented quite as well as I would like), and we played with our own home-built gradient-descent. It was a roller-coaster, but we still get some unexpected behaviors. Bugs clearly remain, which is good, actually,
because it means that we can only do better than how we are doing now, which is pretty good.


if your sample isn't contaminated, you aren't trying hard enough

At Gaia DR2 parallel-working meeting, Adrian Price-Whelan (Princeton) and I discussed co-moving stars with Jeff Andrews (Crete). Our discussion was inspired by the fact that Andrews has written some pretty strongly worded critical things about our own work with Semyeong Oh (Princeton). We clarified that there are three (or maybe four) different things you might want to be looking for: stars that have the same velocities (co-moving stars), stars that are dynamically bound (binaries), or stars that were born together (co-eval) or have the same birthplace or same abundances or same ages etc.. In the end we agreed that different catalogs might be made with different goals in mind, and different tolerances for completeness and purity. But one thing I insisted on (and perhaps pretty strongly) is that you can't have high completeness without taking on low purity. That is, you have to take on contamination if you want to sample the full distribution.

This is related to a much larger point: If you want a pure and complete sample, you have to cut your data extremely hard. Anyone who has a sample of anything that is both pure and complete is either missing large fractions of the population they care about, or else is spending way too much telescope time per object. In any real, sensible sample of anything in astronomy that is complete, we are going to have contamination. And any models or interpretation we make of the sample must take that contamination into account. Any astronomers who are unwilling to live with contamination are deciding not to use our resources to their fullest, and that's irresponsible, given their preciousness and their expense.


latent variable models: What's the point?

The only research time today was a call with Rix (MPIA) and Eilers (MPIA) about data-driven models of stars. The Eilers project is to determine the stellar luminosities from the stellar spectra, and to do so accurately enough that we can do Milky-Way mapping. And no, Gaia won't be precise enough for what we need. Right now Eilers is comparing three data-driven methods. The first is a straw man,
which is nearest-neighbor! Always a crowd-pleaser, and easy. The second is The Cannon, which is a regression, but fitting the data as a function of labels. That is, it involves optimizing a likelihood. The third is the GPLVM (or a modification thereof) where both the data and the labels are a nonlinear function of some uninterpretable latent variables.

We spent some of our time talking about exactly what are the benefits of going to a latent-variable model over the straight regression. We need benefits, because the latent-variable model is far more computationally challenging. Here are some benefits:

The regression requires that you have a complete set of labels. Complete in two senses. The first is that the label set is sufficient to explain the spectral variability. If it isn't, the regression won't be precise. It also needs to be complete in the sense that every star in the training set has every label known. That is, you can't live with missing labels. Both of these are solved simply in the latent-variable model. The regression also requires that you not have an over-complete set of labels. Imagine that you have label A and label B and a label that is effectively A+B. This will lead to singularities in the regression. But no problem for a latent-variable model. In the latent-variable model, all data and all known labels are generated as functions (nonlinear functions drawn from a Gaussian process, in our case) of the latent variables. And those functions can generate any and all data and labels we throw at them. Another (and not unrelated) advantage is in the latent-variable formulation is that we can have a function space for the spectra that is higher (or lower) dimensionality than the label space, which can cover variance that isn't label-related.

Finally, the latent-variable model has the causal structure that most represents how stars really are: That is, we think star properties are set by some unobserved physical properties (relating to mass, age, composition, angular momentum, dynamo, convection, and so on) and the emerging spectrum and other properties are set by those intrinsic physical properties!

One interesting thing about all this (and brought up to me by Foreman-Mackey last week) is that the latent-variable aspect of the model and the Gaussian-process aspect of the model are completely independent. We can get all of the (above) advantages of being latent-variable without the heavy-weight Gaussian process under the hood. That's interesting.


playing with stellar spectra; dimensionality

In another low-research day, I did get in a tiny bit of work time with Bedell (Flatiron). We did two things: In the first, we fit each of her Solar twins as a linear combination of other Solar twins. Then we looked for spectral deviations. It looks like we find stellar activity in the residuals. What else will we find?

In the second thing we did, we worked through all our open threads, and figured out what are the next steps, and assigned tasks. Some of these are writing tasks, some of these are coding tasks, and some are thinking tasks. The biggest task I am assigned—and this is also something Rix (MPIA) is asking me to do—is to write down a well-posed procedure for deciding what the dimensionality is of a low-dimensional data set in a high-dimensional space. I don't like the existing solutions in the literature, but as Rix likes to remind me: I have to put up or shut up!