Observation Accuracy Experiment v0.4 is Underway

We delayed Observation Accuracy Experiment v0.4 a week to not distract from City Nature Challenge 2024. We're asking contacted validators to assess the sample by May 13.

We made 4 changes to this experiment from v0.3:

  1. Validators are now matched to sample observations based on past ID behavior in the same country rather than continent
    Previously, we matched a validator to a sample observation if they had at least 3 improving IDs within the same continent. We're now requiring at least 3 improving IDs within the same country to try to better match observations and validator experience. If we couldn't find any validators for an observation sample within the country, we expanded our search to continent and then globally, but we did this rarely. The samples are getting a bit large for some validators so we hope its not becoming a burden. Please let us know.

  2. We added a disclaimer to the message to comment here rather than replying to the message
    We've been receiving a large number of responses to the messages we send to contact validators that we aren't able to properly read and respond to. We added a disclaimer to the message with a link to this blog post asking people to post their questions and feedback here rather than replying to the message.

  3. We added more details about how to view your sample when viewing the message within the Android app
    Some people trying to access the message from the Message section of the iNaturalist Android app have been having trouble navigating to the sample. We included more details explaining how to do this.

  4. We added a search parameter to view observations included in an experiment sample
    Now that we're sampling 10,000 observations, clicking bars on a completed Experiment page is limited to exploring the first 500 observations. Longer term, we plan to make improvements to that page. But for now, we added a search parameter to construct explore URLs that, similar to how projects work you can construct Explore URLs using observation_accuracy_experiment_id, so you can see the sample of observations included in an experiment. For example, here are URLs for all the experiments so far (remember the version e.g. v0.4 isn't the same as the id e.g. 5):

Other than these three changes, this design and logistics of this experiment are the same as v0.3. As before, the Experiment page is live but the results will be updating daily and won't be finalized until May 13. At that time, we'll update this post with more discussion from the results.

Thanks again to all validators we contacted participating in this experiment. We wouldn't be able to conduct these monthly audits of iNaturalist observation accuracy with out your help!

Results (added 05/13/2024)

The results of this experiment were very similar to the other experiments. The average Research Grade accuracy (fraction correct) was 95%. You can explore the results including clicking through bar charts to observations here.

From all 4 experiments we've conducted, we've now assessed 22,000 observations including 12,464 Research Grade observations. The graph below uses Research Grade observations from this combined sample to estimate accuracy subset by continent, taxon group, and rarity (<100 observations is rare) and sorted by the uncertainty (95% confidence intervals). We now have much better estimates than we had before this experiment, and for many of the common subsets (black) we now have large enough samples to get confident estimates. But for all of the rare subsets (orange) our sample sizes are still too small to be confident in our estimates. As discussed in this thread below, we will probably have to design an experiment with a non-random sample targeting rare taxa to include enough of them to reduce the uncertainty accuracy estimates for these rare subsets.

Thanks again to everyone who participated in this experiment! We know this was a busy time on the heels of City Nature Challenge and very much appreciate your helping improve these accuracy estimates.

Another one? And just after a solid week of identifying for the City Nature Challenge - I dont think I ever want to have to do an ID again!!!!!

Thanks for adding the - do not reply - disclaimer.

Thanks @tonyrebelo - we very much appreciate all the IDing help. We know the timing is not great given City Nature Challenge. Thanks for all you do

@andresvila same here! i just got one little observation of a captive donkey from 6 years ago... i wish i could help more

I identified 13 species of the order Odonata, but I don't know how good a test this is, as the majority of them were among the most easily identified species of dragonflies, mostly from eastern United States. Much better tests would be to dive into the tropics if you want to know what percentage of species are being correctly (or at all) identified. Perhaps you are doing that also.

I wonder why I got 53? Will try tomorrow when our internet is not as burnt out as I am.

I appears that the plea page must be responded to, in order to turn off the red mail flag.

My set this time matched what I identify much more! I'm a little sad I only got to identify two, but still happy I could help!

@arnim - a message just needs to be viewed for the red notice to go away, it doesn't need to be replied to.

I got my own observation for one of the ID's. Not sure what to do about that, but I did not ID my own observation.

Three of the nine I received were observations I had already made identifications on.
Two of those three were ones on which I made the "Improving" ID to the currently research-grade taxon.
How should I deal with this?

I also got a bunch that I had already added IDs to. I assume we don't need to add an ID again?

Thats right. If you've already ID'd an obs and you still stand by your ID, please skip. Otherwise. update your ID. Thank you!

Like last time my set matched my IDing behaviour well.. maybe even a bit better now. I did 23 observations and the number is fine for me.. but I also usually ID much more in a day...

To the people wanting to help out more.. click on the "v0.4" under number 4 above and set your filter... I will probaby check out the other spiders as well

A couple native species in the set this time yay!

Great. One suggestion I have is to maybe include the suggestion to NOT say an ID is part of the experiment unless the ID is broader than the community taxon. I keep getting notifications of people adding agreeing IDs because they comment that it's part of the experiment. The completely unnecessary notifications are pretty annoying. It's not a huge deal, but a suggestion in the message might ameliorate the issue.

I had one obs that was an images showing different species case. I did the usual, IDed (disagreeing) as common ancestor, left a comment and added my (the first) 'no' vote to 'Evidence related to a single subject'.

Done. Phew! Is there anyone else who finds the identify format incredibly difficult to use?

The new strategy for matching observers to observations seemed to be an improvement this time, at least for me.

However, it is less and less clear to me what purpose this experiment serves and what can be learned from it.

What does the "accuracy" being measured really tell us? The likelihood that any randomly selected observation will already have an ID that experienced IDers on iNat agree is correct? Whether a randomly selected observation can still have its ID refined? These seem to me to be fairly trivial.

The inclusion of non-research-grade observations in the last couple of rounds means that we are also looking at observations that do not yet have a community consensus -- i.e., I have several where the only ID is that of the observer, or it is a general ID like "flies" that could probably be refined by a specialist (which I am not), or it is at a high level because there is a wrong ID by an unresponsive observer and not enough people with relevant expertise have looked at it to override that ID.

What are we measuring in such cases? If I ID it as "flies" does that indicate it cannot be refined further? If I add a refining ID or add the additional ID that overrides a disagreement, what does this tell us except that the observation had not yet been reviewed?

I'll look at my set and add IDs to those where I can provide a meaningful contribution, but I honestly don't see any point in adding additional confirming IDs to common local plants or honeybees. There are lots of other ways I can spend my time that would do more to improve iNat's data.

Thanks @ajott - but just to be clear an obs won't be considered validated unless the IDer meets the at least improving ID for that taxa threshold. But for sure chiming in and sharing expertise on obs will likely help get consensus around their ID even if people don't meet the official validator criteria.

I had 69 observations to ID and went through them, corrected some, and mostly supported others' ID's. Can't wait for the results!
And good thing to put not to comment (I can see hundreds of people, including myself, commenting if it wasn't mentioned).

How many of these are there going to be?

surprised at the people getting dozens of observations to ID -- I feel left out, having just the 5!

We can leave it without an ID, since it goes to Casual with the new DQA - was in the info from iNat.

(I did the usual, IDed (disagreeing) as common ancestor @richyfourtytwo fourtytwo

I had 125 observations to review. One was of a Surf Scoter that had sat around as 'Needs ID' for 2 years! It's now RG. The other observations were a good mix of waterfowl, lady beetles, and other groups I occasionally ID. I was surprised by how old the observations were. Some were from 10 years ago!

I got 6 observations which seems to be the average.

One of my observations was Rubus armeniacus, which is a hotly debated taxon! Maybe someone else will say it's R. bifrons and the iNaturalist curators will try to sort it out a bit more. It's really a mess.

This time I got quite a few of casual observations (no location, often no date as well, from accounts with very few observations and weren't active recently). Often the first ID was from a few hours ago, so I suppose they started out as unknown before this experiment? But honestly, I'm not sure how much can be gained (in terms of this experiment) from sticking yet another broad-level ID on those.

I only got 2 observations to ID and I’d be happy to do more. Wish I got a couple of South American Mangora, as they are my favorite, but oh well. Maybe I didn't meet the requirements for those :(

@iluvspiders, maybe next time :)
(I had only very few my first time on one of these experiments.)

@davidenrique agreeing IDs with included comments show up in your notices? My apologies, I didn't realize that would happen! I thought documenting all IDs related to this experiment would be valuable (even though it was more time consuming than just clicking the agree button); without that, it could be confusing to see multiple new IDs being added to observations that had been research grade for years, and those added IDs could be interpreted as trying to run up one's ID totals.

My problem is all the notices about this thread -- eight within an hour. I'm unsubscribing; I will check back later to see if my question was answered.

Several of the observations I was asked to ID in Observation Accuracy Experiment v0.4 were out of my state. I usually do not attempt things out of state because I do not know as much about what is possible and there are plenty of observations to work on from my state.

@larry216 Look up some famous observations, where there are dozens or more identifiers, all saying the same thing. Adding to their collection of species maybe? eg.

I have done my duty!

After all that IDing for CNC, I told myself I would take a break. But then I got my first request to participate in this project. Plan for break summarily ditched.

@huttonia Yeah, maybe next time. Turns out there weren’t any South American Mangora in the set of observations at all, so I got unlucky lol.

@larry216 Yep. I have notifications enabled except for agreeing IDs (with over half a million ID's, I'd drown in notifications if I got them for agreeing IDs too!). I don't want to miss dissenting IDs or comments, which are often important, so I do get notifications for IDs with comments. I totally get the desire to be transparent about the reason for the IDs, but all the notifications do get a little overwhelming lol.

Is it acceptable to ask the observer for more info? I have one to ID that is a plant that seems to be something that does not grow naturally where the observation is sited (even though obscured). I have asked them if it is a cultivated plant (not marked as such), but I don't know if that is appropriate for this experiment.

@davidenrique I wish there was a way to turn off notifications for Observation Fields. None of the available mechanisms seem to affect them, including muting the user adding them to hundreds, thousands of my observations.

@xris I know this isn't what you want per se, but there IS a way to restrict who adds observation fields to your observations. You can go your account settings> Content & Display > and change the "Who can add observation fields to my observations?" option to "curators" or "only you".
I don't recommend doing this, as observation fields can be useful, but that's an option if it's really affecting you too much.

@vireyajacquard Yes that sounds like a normal part of the ID process for a high confidence ID.

@davidenrique I don't mind them doing it. We've corresponded about their Project, and I want to support it with my Observations. However, I have a lot of Observations that are similar with respect to their selected Observation Fields. I'm aware that they're doing this work. I don't need to be notified about it every single time!

Muting someone should prevent you from being notified when they add an observation field to an observation you're following.

發佈由 tiwane 5 個月 前

@tiwane It doesn't when it's my own Observation.

發佈由 xris 5 個月 前

@xris hmm ok, that may be a bug. Can you email help@inaturalist.org with info, if you have a moment?

Done and done! Was able to ID most of them but there was a really tricky Sterna tern that was eluding me, so had to leave it at genus. Thanks for reaching out, will be happy to help (if I can) with any further rounds of testing!

@tiwane Sent!

It took me a while to figure out that I had to edit the url from "inaturalist.org" to "inaturalist.ca" before I could even log in. Then the same thing happened again when I clicked on the link in the original message, to get to this blog to see if anyone else is having the same problems. Pretty annoying to have "incorrect password" returned over and over when I KNOW I'm entering it correctly (for .ca, not .org, it turns out).

had around the same number of observations to review, but they were more relevant this time. It went faster for me and was also more satisfying.
It was notable that each observation I reviewed already had 2-4 others having added IDs in the last number of hours. I'm thinking that it may help a little to limit the number of people asked to review a given observation... there's pros and cons, but above say 4 qualified people it really does seem.... inefficient.

I have been asked to ID an observation which has the location set to Private. However, without that info I can't ID to species level. Should I still attempt to ID? Any other instructions on how to deal with this case?

What to do with a known undescribed species? It is identified to genus, but will show up only as an agreement at generic level, when most of the identifiers know it to species, albeit a still undescribed one.
I've only seen one case so far (my 48 plus about a dozen that have higher rank IDs posted for the experiment on my and other observations) so about 1 in 80 on biased sample.
We cope with these cases in our community by adding a field (https://www.inaturalist.org/observations?verifiable=any&place_id=any&field:New%20species%20reference%20and%20name=Strychnos%20sp.%20nov.), but the experiment wont know this.
I doubt it will be the only case in Exp 4, but probably irrelevant overall (<1%) - or perhaps not for some areas?
Anyway, as usual a fun exercise: only a few observations that I had to pass from other parts of the world or groups I dont know - most were within my domain.
Thanks for the opportunity to participate.

Would be great to have a 'multiple observations of the same organism' box to select in the Research Grade Qualification list. I often ID unknowns and there are often numerous observations from one account of the same organism. I leave a comment when I come across it to inform them of what to do but a lot of the time it's ignored. It must surely skew the data when 10 observations of the same organism all reach research grade?

Please rather word it as "duplicate observations of the same organism"

@kmackau make a feature request.
I won the multi-species battle, you tackle this one?
(Meanwhile I have a copypasta for that)

'duplicate observations of the same organism' is much better wording @tonyrebelo :) How do I make a feature request @dianastuder ?

Only one for me, as I am still not able to ID much. And I believe that it was two photos of different species. Does this mean that I should go through my observations to check for the same? Ha

Great! That's the first experiment when I got observations fitting my ID group perfectly.

Think about it - long, and hard @kmackau . Write the text, and edit.
Then go to the Forum and - add feature request. For another new DQA? That would be the elegant guidelined solution (especially if it also tips to Casual, until resolved by the observer)

PS - new route (mine was in the email to tiwane days) - https://forum.inaturalist.org/t/about-the-feature-requests-category-please-read-before-posting/69

My copypasta for these problem obs

See also (insert other obs numbers here, which is ... tiresome to do, but useful for the next identifier, and an eventual techie / DQA solution)
Please combine multiple pictures of the same individual of a species

@loarie about 20 or 30 obs would be fine - but 2 pages is one too many.
Perhaps you can tweak to offer a few more across identifiers.

@dianastuder - looks like we shared quite a few observations to check. Not surprizing perhaps, but perhaps one way to rationalize number?
I dont mind the numbers. Would be happy to to say 100, if it was my interest group. One observation in my set (https://www.inaturalist.org/observations/21582820) took as long as all the others combined: a Giraffe in a zoo in Bakersfield - raised hackles: sun high for 19h22, no locality, two observations on at 19h30 is gloaming, no website zoo, farm or petting place shows Giraffe in the area. Looks like Maasai, but I wanted to check against zoo records, but cannot place it. [I have only identified the two subspecies of Giraffa giraffa, but this was only ID'd to a deprecated generic level because of a swap following the creation of 4 species from previously 1 sp and 7 subspecies, so I guess I was included for a California ID; my other way-out one was a sterile Pterocarpus to genus in Mexico, but their species are different from our African ones, so I just Dicotted it).

@gpohl your same password should work on iNaturalist.ca and iNaturalist.org. If not, please contact help@inaturalist.org.

I'm confused, I clicked on v.04 and 9k observations came up. Am I doing something wrong and does it matter if I, ID those? I haven't had anything sent to me via email.

I would love to be able to use the algorithm that matches me to observations for this experiment more generally. It would be neat to have a custom page of observations in need of research grade that have been matched to me as a possible IDer, based on my past IDing behavior. I think this would encourage me to ID more. Maybe there are good reasons not to do that, but just thought I'd put the idea out there!

Please remove me from inclusion in any future versions of the accuracy experiment, I think that the experimental design where I am asked to ID plants in areas where I would NEVER id (Mexico, California, Arkansas for example from this round of 108 observations (5 hours)) means that the experiment is deeply flawed, even with me opting for the high level taxonomic groups that eliminate my input- while wasting my time.

Since we have a link to this whole batch, may we filter to our taxon and / or location and see what is there for us to ID? Or rather NOT this batch until after the deadline?

@loarie A few points of clarification about ajott's question:

1.) If I have three improving IDs to a genus in a particular country, and an observation that I had previously ID'd to genus is part of the experiment, would I still count as a validator if I change/refine my ID (even though I was not sent it originally)?
2.) I the answer to 1.) is yes, when is validator eligibility determined/is it fixed? For example, I notice that before this experiment it appears I had exactly 3 leading and 2 improving IDs of genus X in country Y. One of my leading IDs was on an observation selected for the experiment, so now that leading ID has been converted to improving. Would that make me eligible validator for that genus in that country now? If so, if I refine that one ID to agree with the community taxa, would I lose eligibility because my active ID would be supporting?
3.) On another observation I was sent, it seems like there is a chance the original community taxon was either wrong or too confident and might either be backed up to a higher level or actually change to a different taxa. Would a change to the community taxa to one where many of those of us who were sent it would not be eligible as validators cause us who were sent it to lose eligibility as validators, and would others gain it? (this was my main question, the first two questions were just setting it up; it seems important because we almost certainly are more qualified to say what it is not than what it is)

Depending on the answer to question 3, I think it could be an interesting follow-up experiment to re-send just the subset of observations that were scored as 'incorrect' or maybe also 'uncertain' to a larger set of potential validators (perhaps a higher target redundancy and inclusion of a more expansive set of validators from the common ancestor/new taxa), to see if the community taxon changes further. This could be an interesting cross-check of the validity of the validators in disagreements, and might help refine the ambiguity of the 'uncertain' category.. Obviously you would archive a frozen version of the results before re-sending anything, and such a set would not be random.

I know this is not a permanent feature, but I like this experiment. It would be a cool feature to opt into monthly messages with a link to a "random selection" of observations that fit your identification habits. Could be a nice way of getting older observations that for some reason have been overlooked in front of identifiers again.

@eyekosaeder Why not auto-email yourself once a month with a link? such as:
發佈由 tonyrebelo 5 個月 前

@tonyrebelo well, now that's amazing. I didn't know it was possible! Thank you! :D

My subset size jumped to 82 with this one and most seem to be good matches for what and where I usually ID. There's one outlier in a different country but I have in the past indeed added IDs there so that's just icing on the cake. And even more fun: One of my own observations is included in this sample. Awesome to see the IDs pour in! It's RG at genus level (cannot be improved) and so far nobody has dared to take it to species yet (which is good, because I honestly don't think it can be narrowed based on the evidence provided).

I got only 6 Chironomids. None of them were Id-able to genus let alone species.

My subset has 37 observations to ID. Piecemeal compared to what I usually identify. So just keep 'em coming. I have opted out of the CNC treadmill. In central Alberta it is barely spring when the CNC happens. I funnel my resources to more meaningful projects (such as this one) instead of IDing little snippets of greenery or leafless shrubs and trees. The observation subset is now fairly well matched to my identifying skills, which is great. From what I see now, most ID revisions happen to substandard observations that have blurry images, include more than one species, or lack locality data. It is too bad that such observations influence the end results negatively. I am looking forward to the outcome of this experiment.

Thanks @dianastuder - sounds terrifying! I'll have a look...

I felt confident enough to provide a species-level ID on 11 of the the requested observations, genus/section-level ID on 4 observations, and I did not have the confidence to provide any ID for one. Thanks for letting me help out!

I do think some countries such as the united states are so large, that it makes this a bit hard. I was getting observations from Washington State based on things I IDed in California, and while the species in question is in both places, the look-alikes are different from one place to another. For larger countries it might be helpful to break it up more than by country.

I am curious about something that came up. I am wondering how this experiment handles things like this.


This observation is a cottontail rabbit track, confirmed 7 years ago by several certified wildlife trackers. However, the experiment 0.4 asks folks who have previously identified cottontail rabbits to identify this track. I think the AI doesn't have fine enough sorting capabilities to sort out track images from images of the actual animals. So, people who may not necessarily be trained in tracking are being asked to identify something that's outside their wheelhouse. (A track rather than an animal.) That in turn lowers the quality of the original observation from "cottontail rabbits" to "mammals." So, in effect, this has made the classification of this observation drop to a broader category than it originally was. Is this something that the folks doing the experiment are testing for? I'd be interested in seeing the results of this experiment. Or maybe an experiment designed specifically for animal tracks and sign?

I identified all 14 observations I was asked to look at. Most were the animals themselves. Curious if others got a mix of types of observations too? I did get a few tracks in the first rounds.

This is fun. Keep it up. Interesting thoughts are coming to mind out of this experiment.

@beartracker I don't understand how adding an ID of mammals to a RG rabbit obs makes it lower quality. The community ID stays the same unless you explicitly disagree with the lower level IDs.

That observation is still "Cottontail Rabbits". No-one has disagreed, so that hasn't changed.

I guess I should have worded that different. Perhaps diluted would be a better word. I am trying to figure out a better word.

@beartracker you can experiment with how the CID algorithm reacts by adding and removing your ID - a broader ID without disagreement has no effect on CID.

Not sure this really helps quality control, when I am offered previous IDs of three experts who all agree with each other. I confess to agreeing reflexively and only actively revisiting a couple a minute later because I felt guilty. In some parts of the scientific world, only blind IDs would count for quality control. Why not pick examples which are only just RG, give neither observation references nor prior IDs nor comments, and ask reviewers to do blind IDs?

@ditchingit the 'non-blind'/transparent ID process here is the same as for every museum and herbarium specimen that gets IDed by a second, third, etc person: each new identifier has full access to/can see previous det slips, including who made the IDs and what those IDs were. I understand your point, but iNat's approach is the norm rather than the exception

@charlie That certainly is tricky.. it is even not sooo much about size of a country, but also about how divers it's habitats are. You might be able to ID at the seashore but not so much on the mountain top. However, in the end it is totally fine to not (or just roughly) ID observations one does not feel comfortable with

I was asked to participate despite being a generally broad-only identifier, not sure I added anything of value haha. I got a couple blob style photos I couldn't make out and would ordinarily just hit reviewed and skip, had to put a non disagreeing "life" ID on one that already had cID at order which I'm sure will look very silly to anyone looking back at the obs without context.

I was sent a sample of just one and it was in another country. Not familiar with plants there.

Done. This time was much easier. I had 17 observations mostly from regions ans species I am familiar wiht. IDying was much easier than the last time. I had one observation without location whicht did not make much sense to me.
However, it was great fun for me to participate!
Is there anything wrong if I would use the filter to through observations in my aera and ID them?

Got 70+ observations this time, actually liked it, but it felt a little like an actual id session. ':D
Caught the wrongly ided bird, quite a slim chance for that!

Thanks for the great discussion and for participating. We've already had over 3k validators respond and over 94% of the sample validated. We can't tell you how grateful we are to be able to engage the expertise of so many of you to do this work. This incredible community of expertise really is what makes iNaturalist so special and we are so appreciative for your participation.

So far, the results are looking almost identical to past experiments with the research grade subset having 95% correct, 1% incorrect and 4% uncertain.

A lot of the conversation here is about whether the average accuracy of the iNat dataset is the right thing to measure. Given how biased it is towards common species, would it be better to orchestrate experiments to estimate the accuracy of rare species? I wanted to talk a bit about this topic.

The graph below shows dragonflies and damselflies (order Odonata) on iNaturalist. There are about 6000 described species of Odonata. On iNaturalist we've observed almost 4,000 of them. The black curve shows the number of observations by species sorted from most observations (Blue dasher) on down. This is on a log scale which shows you just how biased the iNat dataset is towards common species. Blue dasher has almost 100,000 observations, but only ~1000 species have at least 100 observations. Pale bluet is an example of a species with 100 observations. Many species like Dull Jewel have only 1 observation. Others like Swamp Groundling still have 0 observations.

The vertical line separates this subset of ~1k on the left side of the graph with >100 observations (lets call them common species) from the subset of around 5k on the right side of the graph with <100 observations (lets call them rare, data deficient species). This also roughly corresponds to the set of species in the Computer Vision and Geomodels vs those that don't have enough data. The pink line shows the number of IDers with at least 3 improving IDs (candidate validators) for each of these species. The ratio of candidate validators to observations is about 5%, ie a species with 100 observations has 5 candidate validators.

One of the interesting things about iNaturalist is that part of our mission is about getting lots of people connected to nature and part of our mission is about scaling global biodiversity monitoring. The former is focused more on the left side of the graph. Most of the dragonflies most of the people using iNaturalist are going to see are going to be species like Blue Dasher, and its really important that iNaturalist works well and provides good information about these species. The latter is focused on the right side of the graph, most of the discoveries and important contributions to science and conservation are probably coming from this frontier of rare species.

Much of our philosophy for bringing these rare species into focus is just to grow iNaturalist as a whole. That lifts the whole curve. As iNat grows, yes we'll get many many more encounters with Blue Dasher (which is really important for our mission of building broad advocacy for nature), but we'll also get more rare species too.

But because these experiments are using a random sample of iNat observations, they are biased towards the left side of the graph. For example, in Experiment v0.3, there were 134 common Odonata in the sample of which 128 were correct giving us an estimate of 95% with pretty high confidence. There were only 3 rare, data-deficient observations in the sample. All 3 were correct, but this estimate of 100% is very uncertain and low confidence due to low-sample size.

It sounds like there's quite a bit of interest about trying to estimate the accuracy of observations of these rare species. We can design a non-random sample that will do that. With fewer candidate validators, it will be harder to orchestrate and we may need different validator criteria. But we agree it would be very interesting and important towards better understanding the right side of the graph.

@misumeta I can't say for sure without checking, but it feels like each time I get more and more, not complaining, initiative leads to that. :)

I completed my 105 ID's, but possibly not as well as I often do with only the new observation that I just looked at to think about. While I knew that I had until May 13 to complete them, I wanted to get them all done. I usually ID a relatively large range of organisms, but mostly lowland, terrestrial organisms, and mostly in my "Puget Trough" bio-region, as I have delineated it. Appropriately I didn't get any salt water organisms from the Puget Sound itself to ID, and didn't notice any alpine organisms. It was a bit frustrating on a few ID's in the experiment group to be asked to ID things across the continent from me, where I don't know the look-a-likes that occur there, but not too bad. The majority of observations I was given to ID were in, or close, to my Puget Trough region. As I knew I didn't have to give more specific ID's than I could do in the time I felt like spending on it, the cross-continent observations I was given weren't too bad. Normally I vary a lot in how much time and effort I take to make an ID. I high percentage I may do in a second, but when I feel like it, I might take up to hours to work to get a good ID on something I don't already know well, and maybe that I want to know better. Any stats on how well I ID any group would be determined from the ID's I made after an unknown amount of time. I'm sure the average amount of time I spent per ID on the experiment group was less than the average time I spend otherwise, when I'm not aiming to complete the next 105 ID's. I almost felt like iNaturalist only wanted my quicker ID, so when I spent a bit more time on an ID I wasn't sure I was following instructions or expectations. For example, I spent some extra time going through the Similar Species section to make a specific cross-continent ID on a species I would have done quickly if it were in my Puget Trough area.

I also find I need to do a lot of review all of the time to keep up with my knowledge in any given group, some more than others, and I don't expect the sample I was given to ID reflected how up to date I was reviewing any given taxon group. For example, I used to do a lot more work every Spring and Fall, reviewing all of my fungi to keep up with my fungus ID knowledge, but haven't been inspired to do that much work reviewing my fungi for maybe 2 - 3 years now, so my current ability to ID fungi may be less than my fungus ID stats would show.

I now hope that what I did helps determine how accurate iNaturalist ID's are. While I think iNaturalist ID's are generally pretty good, I don't count on a "Research Grade" ID to be reliable.

this frontier of rare species

@loarie then please reconsider destroying placeholder text. Your rare species are often carefully named in 'shall we use this as a placeholder for you?' Yes, thanks!
That is what I use to - flag for curation - please add missing species.

We have discovered that iNatters can get around that working as intended, if they opt out of CID. Then the observer's intention is respected. But only ONE iNatter has shown me that.

@loarie Scott, I read with great interest your analysis above of the Odonate sample. Randomized sampling is great but I have many of the same questions about commoness-rareness and biases in the validator population sample (selection bias, not ID or personal biases). Here are a couple of questions that popped into my head:

If I'm recalling correctly, on the Experiment 0.4 page, there's a graph showing the number of validators vs the number of taxa assigned. It seems that the number of validators is ripe for biases. Let me preface this by saying that, while taxon commonness and ease of identification are not necessarily correlated, I expect they are strongly associated. This will give rise to this conundrum: If there are on average more validators assigned for common (read: easily identified) species, then the accuracy statistic will be biased upward. Conversely, if a few specialists are all that are "available" to validate a handful of rare (read: harder to ID) species but apply their expertise to those groups with enthusiasm, that could potentially bias the rare species towards higher accuracy. Perhaps the small subset of validators selected for looking at and identifying rarer species may possess a non-random level of knowledge and/or dedication to IDing such observations. I know that iNaturalist is not inclined in general to "select" experts (e.g. curators) or gage expertise, but for the Accuracy Experiments, there would seem to be a need to look into stratified sampling of validators under some such criteria, both for common and rare species.

Speaking of stratified random sampling, I'd like to hear more about the randomized sampling of observations to obtain a set for each experiment. Is a random sample of the complete pool of observations truly the best way to eventually gage "accuracy" of IDs? To answer my Q1 above or to delve into other nuances of biases in commonness or rarity, it might be useful to stratify the sampling of observations on the commonness scale to look at stats independently in three or four gross levels (e.g. abundant, cosmopolitan species; regionally common species; regionally rare; local endemics).

It's clear that the task of validation is not shared equally among the pool of validators. Comments here and elsewhere show the range of attitudes--for lack of a better word--towards being the recipient of high or low numbers of observations to validate. Are validators with just a couple of observations more likely to delve into the nuances of IDs and provide stronger support for outcomes? Are overtaxed validators who have too much on their plate or for whatever reasons must minimize their validation efforts going to offer weaker efforts at validation choices? This would be a hard spectrum to delve into but I sense from the various comments on each experiment that it may be a concern.

Thanks again for the superb analyses.

An experiment that tells us that RG observations of common and easy to ID species are generally ID'd accurately is not very informative. It does not require monthly experiments to confirm the results.

There's a difference between "rare species" and "difficult to ID species". I actually suspect that the really rare (and unfamiliar) species are probably ID'd fairly accurately, because the average user does not know enough to suggest them, particularly if they are species not included in the CV.

While many commonly observed species are also easy to ID, there are plenty of species that are common but difficult to ID for one reason or another. For example, I've been reviewing European Xylocopa observations. These are big, conspicuous bees and consequently observed fairly frequently (23,000+ observations in Europe). In the areas where the ranges are known to overlap, the misidentification rate of RG observations (wrong species, or a species level ID where it should be left at genus or subgenus) is quite high. I haven't been keeping statistics, but I'm fairly sure it is well over the 5% error rate of the iNat-wide experiment.

While such taxa may not make up a large enough portion of the dataset to substantially affect the overall accuracy rate, this does not mean that misidentification of these groups (and lack of expert IDers) is not a problem. Most people are not using iNat's data set as a whole; rather, they are interested in specific taxa. It means very little to be told that most plants are correct if one is studying mosses or ferns or even one of the tricker groups of vascular plants (daisies anyone?). The fact that honeybees are generally correct is of little value for someone studying solitary bees. Etc.

For difficult taxa, a high accuracy rate is a likely to be not a reflection of how well the average user manages to ID the observations, but of the existence of a handful of indefatigable IDers who have gone through and corrected all the observations. Talking about "accuracy" in an abstract sense fails to acknowledge how much this depends on the knowledge and effort of specialist IDers, or the fact that if these people were to stop IDing the accuracy of that taxon would likely decrease substantially. For these taxa, the community is not largely self-correcting, because IDing requires knowledge and skills that the average user does not have.

I concur.
Rarer species, whether easy to ID or not, tend to be identified by specialists, or else ignored at some higher level (tribe, genus), unless similar enough to a common species to be misidentified.
These misidentifications may be an insignificant proportion of the common species, but may be a substantial proportion of observations of the rarer species.
Our specialist identifiers are the heart of documenting our biodiversity. Groups without specialists languish at higher taxon levels, as do groups urgently in need of taxonomic review.

INat is currently incredibly slow, trying to upload 50 observations for the last 3 days now. Thanks for the invitation to take part in the Accuracy Experiment v0.4. I will certainly consider such an invitation as soon as iNat restarts working at normal speed.

From Antilles? No problems in South Africa - just uploaded 217 in the last hour.

There is a general problem with images getting to "research grade" too easily. The following scenario occurs extremely frequently:
An observer posts an image they cannot identify. A second person offers an identification for the image. The observer then agrees with the suggestion, but actually has no personal knowledge to back that up. Bingo, the image is "Research Grade" but in reality only one person has offered an identification.

I suggest that where the poster has not initially identified the image, then their agreeing with an offered ID will not count. Two independant ID's should be required.

Yes, that happens too often

I worked through these very fast because I'm teaching a class. I hope I did OK. Most were very appropriate observations to ID. I did chuckle at the palm tree from United Arab Emeriates, not a place this botanist from Oregon usually ID's for, but I have to admit that I did ID a number of photos there once, when there was a problem school class. Unfortunately, all I could do with the palm was Palm Family.

Yes, I did chuckle a bit at a Rubus armeniacus/bifrons observation. We argue over the best name, so hard to know what the name posted on it may mean.

Help, what am I supposed to do with this one? Just leave it? :D

Somehow I got some plants to ID as well this time, even tho I never ID plants, only terrestrial isopods.

Do these need to be done by the end of the 13th or before the 13th? "By the 13th" can be interpreted both ways.

I got 20, which was quite manageable, as I normally do a fairly high volume of IDs on ants and wasps as it is, the paper wasps in the dataset did slow me down a bit though, I do like IDing those but I don't know them off the top of my head so I have to scan through my field guide for each one, but overall it was not a burden at all

I am confused as to why I got obs that were not RG? I thought this was supposed to be measuring accuracy of RG obs?

I hope by the end of the 13th - still have a few I wanted to check keys for but have to make it to the finish line for the semester first. Final exam week, graduation events over the weekend, and grades due by midday today has taken priority.

the deadline is the end of the day (2024-05-13 23:59:59 UTC) - thanks everyone for helping validate this sample!

I just want to point out that 23:59:59 UTC is not the end of the day in most time zones, here in the eastern US in daylight savings time it is 1 second before 8:00 PM

I received two observations to identify - they were both previously identified by myself I also did not change my previous identifications

We've update the post above with the results - thanks again everyone!

A focus on both: rarely observed and hard to ID (rare or not) species is certainly the way to go.
From my experience with insects in Africa, I see several problematic issues here:
(1) Observations are left at genus or family level or follow CV suggestions blindly because there are no experts active on iNat --> validitation not possible due to lack of competent validitators
(2) Observations are identified to species level because it has been identified on iNat to that species, but the taxon is in need of revision and Ids to species are not possible, one such example would be Eristalinus megacephalus in Africa which still has some RG observations https://www.inaturalist.org/observations?nelat=-8.2032838&nelng=38.2216904&place_id=any&swlat=-47.1313489&swlng=11.4696999&taxon_id=359895 (the issue is that E. megacephalus could also be E. tabanoides and currently there is no reliable info available on how to tell them apart or if they are a species complex). --> validitators should be able to know what is identifiable and what should be left at subgenus/genus/subfamily. --> do we have these expert validitators on iNat knowing African Syrphidae.
(3) Observations are identified to RG because the community knows the identification was made by someone considered an expert and iNatters blindly agree on such IDs - good if the expert does not make mistakes but bad if the expert makes mistakes Iding something outside their narrow field of proper expertise (expert does not know the regional species or is too lazy to properly check before making an ID) --> validitators must strictly not be influenced by "expert" IDs and do their own research before agreeing.
(4) Observations are left at genus or family level, but a finer ID is possible, but the community does not know how and where to look up these species and genera --> validitators for this category do not exist
(5) commonly misidentified species, some due to faulty CV suggestions such as Neomyia which CV suggests to be Calliphoridae or Brachycerus made Bronchus by CV and other commonly misidentified species. --> Is there a way to filter out such problematic taxa for a validitations sample?

However - please find an experiment design to look into the tricky species - that would add value to the data set.

@traianbertau - but in none of the Eristalinus megacephalus currently in southern Africa https://www.inaturalist.org/observations/identify?quality_grade=needs_id%2Cresearch%2Ccasual&taxon_id=359895&place_id=113055 have you stated this, or have you enforced it (by disagreeing when you added a subgenus/genus level ID - I see you have in some other observations).

I received an observation in Norway, where I've never been, so I was a bit confused about the geographical constraints in this version of the experiment. Reading the description, though, I can imagine one way this may have happened:

"We're now requiring at least 3 improving IDs within the same country to try to better match observations and validator experience."

I sometimes open global unknowns or kingdom-level IDs and try to ID them a bit more precisely, in the hopes that more specialized ID'ers will find them. Mostly these are plants, and unless they're local to me or they're a particularly recognizable taxa, I generally don't ID more precisely than class. But given the phrasing, it sounds like if I moved a few observations from "Unknown" to "Pinopsida" or "Magnoliopsida", that counts as an improving ID and I'm thereby qualified to ID in Norway?

Yes, I should have disagreed, but this is nothing one can support with a scientific paper, just a well known problem.

Excellent work.

Thank you for running the 4th version.

I look forward to participating in any future accuracy experiments.

As @insectobserver123 pointed out above, 23:59:59 UTC is not the "end of the day" for half of the world, and it wasn't an intuitive interpretation for me. Since this is a global effort, I think it would be a good idea to be more specific about the deadline in the invitation email.

I deferred identifying about half of my batch - mainly ones in distant areas where I wanted to check for local lookalikes - until the final day. I thought I had finished with hours to spare, but apparently I identified dozens of them after the deadline.

Yes, I hear "by 6/13", and I think the latest I can put in an ID is either 04:59:59 UTC on 6/12 or 04:59:59 UTC on 6/13, just like my college homework. I would never have imagined the deadline was at dusk

Do you have stats about accuracy versus the amount of people who have identified it?

@kroeckx I've never seen accuracy statistics based on the number of identifications but I can guess (based on experience). Observations with strictly less than three identifications are most likely to be in error, I think. Research-grade observations with exactly two identifications are the most insidious case since such observations do not routinely show up in searches.

@kroeckx, do you mean are we storing the ID ledger on sample observations before the experiment kicks off? We are not - but its something we could do. Its partially possible to rewind/reconstruct an ID ledger to a previous point in time but not perfectly because (1) people can delete IDs and even observations, (2) while only one ID per person can be current at any time there are weird edge cases where someone could, for example, manually mark a more recent ID as not-current and an older ID as current that we can't reconstruct.

In future experiment results, could the Needs ID - Correct graph bars be separated into Needs ID where the initial ID came from the CV vs Needs ID where the initial ID came from a human? I'm assuming that right now the bars represent these two cases in combination.

Except, when the site is fast and I am feeling lazy, I let the CV get my IDs rather than typing them in from scratch. Esp when I am multitasking.
Also, for those senior moments when the name wont come ...

@loarie Sorry, you have a lot on your plate, but I also have a suggestion.

I almost only focus on lepidoptera at larva stage (like, more than 99% of the time?) and mainly got lepidoptera at adult stage in my sample. By the way, mainly very common and easy to identify species so I was able to confirm them.

For pterygota and maybe other branchs, the identifiers community for larva/nymph stages is poor compared to the "adult" experts, each stage is so different that people tends to specialize into one of them.
Problem for this experiment :

The larva/nymph experts get mostly "adult" observations for those they are less comfortable/not able to ID
You loose a pool of (already quite rare) identifiers for larva/nymph stages because those able to identify them received "adult" observations instead.

I know it's just a branch of all Life species that exist, but pterygota are a huge part of biodiversity, so maybe it would be worth it to include Life stage annotation into your identifiers pool's selection technique? And maybe keeping the % of larva/adult identifications they do (so those that can do both, will have both) ?

Other subject, I also agree with the problem of very common and easy to identify species into observation selection. I understand that statistically if we need to get an overall accuracy of ID it's important to do so, but maybe sub-experiments focusing on, for ex the 80% of less common species, would be more usefull to work on accuracy improvment (but that will not erase the problem with common species that are difficult to identify) ? Maybe there is also something to do with the % of a species observations that has/had a disagreeing, to point out where we can work on to improve accuracy ?

About the annoying comments from identifiers that precise this ID is for to the 0.4 experiment that some people points out, I unfortunately do it myself as I had example of people that stepbacked their ID when seeing mine which was at lower level (but not disagreeing), possibly because the larger ID made them doubt of themself (but it was just because I was lacking of knowledge for the species). On the other hand, if they have doubt so easily, maybe their ID was not serious enough...

Thanks for those experiments, having huge number of observations to look at is not a problem on my side. *

Side question, many observers blindy choose AI suggestion as first ID, and that can create a vicious circle of wrong observers' ID --> wrong AI's suggestions (I saw that for example for Noctua pronuba and Opterophtera brumata species, it works much better after a cleaning of the wrong IDs). I wonder if observations at Need ID stage are excluded from the learning set if there is only 1 ID made by the observation that select an AI suggestion ?.

For the purposes of this experiment I suspect that distinguishing between life stages would distort the results, because only a small percentage of observations are annotated. Using only annotated observations would greatly limit the sample pool in ways that are likely not random (insofar as there is a correlation between people annotating observations and checking IDs while doing so).

In terms of matching IDers to observations, life stages (and sometimes other things like sex, type of evidence and media provided, etc.) are of course very relevant, but testing IDers' knowledge was not really the goal in this case. There might be some interesting data related to annotations (i.e., are certain life stages more or less likely to be assessed as being accurately ID'd), but this seems very taxon-dependent and hard to generalize across the entire data set.

Identifier used CV - tells you nothing about the reliability of that ID.
CV is the quickest and easiest tool to get to commonly observed taxa.

You need to judge for yourself whether this identifier - is reliable or simply good intentions, new to iNat and floundering along.

Agreed, I really don't think whether the identifier clicked the CV suggestion is a good variable to control for. I never type out the full name of the thing I'm identifying; if I already know the CV suggestion is right, I select that, and otherwise I still only type just enough characters to bring it up in the suggestion drop-down so that I can select it there.

@spiphany Fair point for the not enough filled life stage, I've forgot that. Community does it a lot for lepidoptera in Europe but many are still missing. It's still possible to do something with the annotated observations that have been selected for the Experiment (like "In our Experiment sample, in this Order we have X% of un-annotated, Y% annotated at stage A, W% annotated at stage B, so all identifiers will get X% of observations from this order without annotation, and annotated observations will be split differently based on behavior of identifiers"). Yes, it sounds a lot of work... can be generalize for all taxons though, even if not usefull for all of them.
Using AI to detect life stage could work but that's another subject and will take more time for the Experiment's organizers.
Testing IDer's knowledge might not be the goal, but giving identifiers observations they have no knowledge for seems to be a problem to judge accuracy of observation's ID.

@dianastuder @guerrichache I was focusing on the ID made by the observer of the observation (sorry, I wrote "observation" instead of "by the observer" in my last sentence) and where it's the only ID made (nobody reacted on it). Indeed identifiers can also use it for their own observations but we could have some side rules like "if AI suggestion has been selected but the observer aleady done X identifications of this species then Iwe keep this observation". Still not perfect but that will erase a huge volume of unsure ID to use in the learning dataset. Many experimented iNat users also do that. In case of bad learning of the CV for a species, keeping all those observations just reinforce the wrong learning. Can be fixed if identifiers correct them, but sometimes there are just zero or not enough identifiers on iNaturalist able to identify this specific species/genus. ==> This subject is distinct from the Experiment, I should talk about it in a CV thread, I just jumped on the occasion to ask a question about it in my previous comment.

I think to me, the issue is more about the user interface; I suspect most users select CV suggestions because it's faster and more reliable than typing the full suggestion; the way the ID interface works, you have to really resist the obvious ease of picking whatever is in the drop-down. If we discard suggestion selections from observers, I'd guess we're discarding the majority of all observer IDs, whether overconfident or not. I'd suspect that avoidance of the suggestion drop-down doesn't correlate with quality of IDs, but rather with comfort with the interface or with technology generally.

Which isn't to say over-confident observer IDs aren't an issue since there's added excitement and sense of ownership from observers (I've been that overconfident observer!), but maybe a more consistent way to address the issue you raise is to just ignore observer IDs for this purpose?

@prunhel you might get closer to your 'bad IDs using CV'
if you paired the Pre-Maverick project with

Observer used CV
Observer's ID was rejected by 2 other identifiers.

This observer has a bad track record of adding joke IDs - but how would / could you know that.
Very difficult to prove, even if you could retrieve a list of obs as 'proof'. I see wrong IDs which I attribute to slow internet, interface glitches, newbies learning to iNat.

Duress users who don't care might fit your criteria.

And frankly if you are new to iNat - we are pretty sure this is Species A - sounds convincing, I'll take it.

Many, many interesting and often important points have been raised by so many people, and thanks for that. One of my thoughts is that just about the most important thing we could be doing is to make darned sure that all Research Grade observations are correct. They go into GBIF, where they are used for who knows how many "big data" studies. Every incorrect data point on GBIF is harmful to science, IMO, although I know of course that there are plenty that do not originate from iNat, even a fair number of misidentified specimens.

I feel strongly that experts should be combing GBIF (iNat, in this case) for such errors. A colleague working on a new checklist of Mexican Odonata found a rather shocking number of species in a variety of databases that were obviously misidentified, as they couldn't occur in Mexico. Someone without that level of knowledge could have published such a checklist with numerous species that shouldn't have been on it.

In working with Middle American Odonata I find a too-high frequency of RG observations that are incorrect. Even though probably below 1%, that's still hundreds of observations. I wouldn't be surprised if the frequency is even higher in South America. And dragonflies are relatively easy to ID compared with many other insect orders.

One of the biggest causes of bogus RGs has been mentioned already. The original observer doesn't have a clue about the ID, and if someone comes along and gives it a name, the observer is quick to agree, and we have an RG identified really by only one person. I think it would be wonderful if iNat initiated an algorithm that would prevent the observer from contributing further to the ID.

AI complicates the picture considerably, as so many people seem to accept it as infallible, with the suggested ID not worth another thought. Working with New World tropical Odonata, I get a fair number of IDs of species from the other hemisphere, and when I reidentify them I add something like "genus not known from the New World.". But I suspect that in most such cases the observer couldn't care less.

Well, enough complaining. iNaturalist is a wonderful resource, and we should be so grateful for all the work that has gone into it and its usefulness. But I know it could be better, and to me pleasing the community is great, as long we keep the emphasis on science.

Maybe we need an IdentiFriday or a project - to monitor distribution maps for Out of Range.
Should be marked Not Wild.
Or - wrong taxon, this is Not Found Here.
I correct them when I trip over them.

I would like CV's Seen Nearby to have a higher barrier than - one ID, by one identifier, on one obs. That very quickly becomes a vicious spiral.

@dianastuder I agree that it is helpful to check observations out of the better documented range for a species, and while most species identifications that are out of the known range of the taxon they said it was are misidentifications, we can't assume an out of range observation is misidentified and the "wrong taxon ... not found here", and can't assume it is not wild because it is out of (the known) range. (If it is both out of known range, and in a residential area, I may call it "not wild".) Range maps are created by observations of species, they don't restrict the species to those maps.

When I find an observation identified as a species far from other iNaturalist observations of that species / taxon I sometimes offer the observation map to the observer and say the identification they made is one for a species / taxon that hasn't been recorded in that continent, or north of / south of ... a far away place, and ask if they can tell us the distinguishing features for the species / taxon they indicated it was. Ideally I can explain the features that their suggested species has that don't fit the species they say it is, and can also offer a good / better identification of their observation, and I might also include an observation map for the species I am suggesting it is.

Also invasive aliens - we have them - and we are also the source for other countries in turn.

Yes, many of those observations out of the usual range may be wild, but invasive / naturalized , or maybe just becoming naturalized aliens.

The CV being used to put a name on the observation does not mean the CV was used to identify it. Like many others, I use it to get the name quickly spelled accurately.

Checking out of range observations does have to be done carefully. I learned that several of our lovely western North America wildflowers are now wild in Europe! However, I mark the unidentifiable immature plants as some more general name (hard disagreement) and mark probable garden plants cultivated (with request for clarification) because I think unusual records require at least good evidence.

In my experience with grasses, obscure species that are rarely photo'd are usually correct because only people who know them photo them. Of course, there are errors even there.

