Consumer Physics Developers Site | Topic: Categorization of users/data on the database

This topic has 7 replies, 4 voices, and was last updated 9 years ago by sakrelaasta.

Viewing 8 posts - 1 through 8 (of 8 total)

Author

Posts
July 16, 2015 at 9:44 am #1423

sakrelaasta
Participant

I am not sure if I have understand fully the way that the database will work, so please excuse me if everything I mention is wrong (and also correct me)

So I was thinking…

It was been mentioned many times that, the power of the SCIO is the people that do scans and add them (with the necessary extra information) to the database.

The good thing is that there will be a lot of people adding data

The bad thing is that there will be a lot of people adding data!!!

What I mean: Can I “trust” every scan and information that everyone will add? Probably no. And not because the user does a lazy job! Lets say that there is a data base for edible Oils. So I add some data also…

– One day I go to the supermarket, buy a bottle of olive oil from a big company, scan it and add all the data that the bottle has about the oil. But in reality the data in the label, are not for this specific oil, but it is an average of all the measurements the company did in all the oil it has. So… maybe this scan+data are not the best for the data base

– The second day I take a sample of my small (home) olive oil production, scan it, and take it to a chemistry lab to analyze it. These data will be much much better.

So should these two scans, have the same “weight” on the data base?

What I think is that maybe there should be a categorization of users that add data according to their background. For example

~ Normal users

~ Users with background in chemistry (or something)

~ Users that used specific measurement in a lab

~ Users that “have labs” (academic or private lab companies)

~ Users-labs that have ISO

This way, after the users scans a new object to see the concentration of “chemical A” (lets say it has concentration of chemical A 31%), can choose:

“Ok… let see if I accept only the “lab data” or “academic data”. Hmmm they are trustworthy, but there are only 5 different samples in the data base and non of them appears to be close to my scan (their measurements are 5%, 7% 15% 50% 52% and 60%)… not very accurate.. I know that in my sample the concentration of “chemical A” is between 15-50%, but it is not good enough. Lets add the “measurements in a lab”, aaa nice! now there are 50 samples, and many of them are around 22, 24, 30, 35% there are many of them close to my spectrum!!! Great!! that is perfect! My concentration is 31%”

Sorry for the long topic… but I couldn’t wait anymore until it arrives to check it myself!!

Nikos

July 19, 2015 at 8:34 pm #1501

redwingii@comcast.net
Keymaster

Congratulations, you broke my brain..

After the 5th or 6th time reading it, I think I’m on your wavelength. Your question seems to be a variation of my question posted in models. You were generic and used chemical A, I just said water. Let me try to restate it and find out if we are tuned in.

There is an object (Chemical A, H20, sugar) that you want to find the concentration of the object in whatever you scan. Can I take that object’s spectrum and use it as a reference to find that object in any random sample. (how it relates to my ???) I think this kind of analysis would require the comparison of more than 1 spectrum sample to the reference (sugar, water) sample of what your looking to find. Right NOW, thats where your algorithms come in.

Currently we set some attributes (sugar, h20, etc..) and tell the CP cloud machine what the quantity of your attributes will be in the sample your about to scan. These attribute levels along with the scan are recorded. When a unknown sample is scanned, CP’s proprietary software compares all the data and finds the scan most similar to the unknown scan then estimates the attribute of the unknown buy the known scans of that kind. To find an attribute in an unknown sample would require some math be done on the reference sample and the unknown sample to eliminate the “noise”.

Thought experiment:

Ingredients

Salt

Sugar

Black microfiber glass cleaning cloth

The 2 attributes are salt content and sugar content.

Procedure:

Under New Sample

pour salt on to cloth and scan. This creates the profile of pure salt. 100% – 0%

pour sugar on the cloth and scan. Same for sugar. 0% – 100%

Under test sample

Pour a 50/50 mixture of salt and sugar then scan

I do not currently believe this will give a 50%/50% readout.

For that to happen you would need to add as a sample the 50/50 mixture.

CP – Am I on the right path or lost in the woods??

BTW: Sak… You made me break my brain again….

July 20, 2015 at 2:17 pm #1506

rejsharp
Participant

Hi Redwingii – you cannot combine dry salt scans in the same model as liquid salt solutions. The molecular state is different so NIR resonance is hugely different.

I have added some scans of dry salt and sugar, and solutions of them into our previous thread.
Roger

July 20, 2015 at 8:50 pm #1507

redwingii@comcast.net
Keymaster

But we are not mixing dry/wet solutions. Salt and sugar are visually nearly identical. By having the spectrum of both pure salt and pure sugar can the SCiO deduce a percentage concentration of both salt and sugar in an unknown sample.

I am sure if we told CP this is what a scan looks like when we have a 50/50 mix, 30/70, 25/75, 10/90 samples, then the concentration of

“A Mixture of Salt and Sugar” would be found by the closest reference spectrum. The spectrum created when there is a 50/50 mixture can’t be determined by knowing the base contents spectrum.

???????? no idea if I’m right ?????????

July 21, 2015 at 5:31 am #1508

rejsharp
Participant

Yes you are right! With a series of test mixtures as reference, the SCiO should be able to estimate the percentages in a mix.
(Sorry for me misreading your text, and muddying the water with solutions).
Ahhh there is of course room for error with powders of different grain sizes / densities – they will segregate out for fun! The fine stuff ends up on the bottom.

I realise this has drifted a long way from Nikos’ original post theme which was asking about validating sample contributions. This is a critically important subject (Rubbish in guarantees Rubbish out). Consumer Physics did mention this a while ago, and I will raise the subject at the Paris Workshop.

July 22, 2015 at 6:45 am #1571

Hagai
Keymaster

sakrelaasta wrote:

I am not sure if I have understand fully the way that the database will work, so please excuse me if everything I mention is wrong (and also correct me) So I was thinking… It was been mentioned many times that, the power of the SCIO is the people that do scans and add them (with the necessary extra information) to the database. The good thing is that there will be a lot of people adding data The bad thing is that there will be a lot of people adding data!!! What I mean: Can I “trust” every scan and information that everyone will add? Probably no. And not because the user does a lazy job! Lets say that there is a data base for edible Oils. So I add some data also… – One day I go to the supermarket, buy a bottle of olive oil from a big company, scan it and add all the data that the bottle has about the oil. But in reality the data in the label, are not for this specific oil, but it is an average of all the measurements the company did in all the oil it has. So… maybe this scan+data are not the best for the data base – The second day I take a sample of my small (home) olive oil production, scan it, and take it to a chemistry lab to analyze it. These data will be much much better. So should these two scans, have the same “weight” on the data base? What I think is that maybe there should be a categorization of users that add data according to their background. For example ~ Normal users ~ Users with background in chemistry (or something) ~ Users that used specific measurement in a lab ~ Users that “have labs” (academic or private lab companies) ~ Users-labs that have ISO This way, after the users scans a new object to see the concentration of “chemical A” (lets say it has concentration of chemical A 31%), can choose: “Ok… let see if I accept only the “lab data” or “academic data”. Hmmm they are trustworthy, but there are only 5 different samples in the data base and non of them appears to be close to my scan (their measurements are 5%, 7% 15% 50% 52% and 60%)… not very accurate.. I know that in my sample the concentration of “chemical A” is between 15-50%, but it is not good enough. Lets add the “measurements in a lab”, aaa nice! now there are 50 samples, and many of them are around 22, 24, 30, 35% there are many of them close to my spectrum!!! Great!! that is perfect! My concentration is 31%” Sorry for the long topic… but I couldn’t wait anymore until it arrives to check it myself!! Nikos

Thanks Nikos,

You raise an important topic and, in general, your observation is correct. Collecting data from a community is complex and not trivial.

Our plan is to gradually roll out a plan where we will request data collection from the community gradually, in increasing levels of complexity and establish tools such as those you mention.

As a starting point (Q4’15-Q1’16), we will collect data that is relatively easy to validate. For example, identification of pills. We intend to request from users to scan medication and classify them. A ‘voting’ system will be implemented, so only if you get enough users with similar scans of the same medication will we accept it. Outliers will not be included.

As we learn more, we will build trust with the quality of data coming from specific contributors, and over time increase the complexity and variety of data we collect from the community.

Hagai

July 28, 2015 at 6:35 pm #1655
sakrelaasta
Participant
Hoa!

I was expecting a mail if someone answered and I haven’t checked. Sorry

Hagai,
thank you for the answer. That is exactly what I meant. For qualitative analysis, probably is easier to organize it. If someone scan a drug, and his scan is nothing like the other scans of the same drug, you can automatically delete it.
Quantitative analysis will be tricky! (Good luck!)

If I understand it correctly each app (yours and other creators’) will have an different database… one specific (for example) about drugs, one for plant leafs etc.Correct?
Will there be also a “general” database that anyone can scan anything, add description and upload?

Also, can creators make an app and then choose if its database will be open for everyone to add, or locked and only the creator can add scans?

redwingii
I may burned you brain (and sorry for that), but it wasn’t enough to understand the way my crazy brain works!!
I have thought also what you mention and I agree with you.
You can understand what I meant if you read Hagai’s answer.
But lets play that game again and explain again! I have the time.

I has just asking if we (the users) could choose during the analysis of a sample, if we want the algorithm to take into consideration only some certified contributors or to use all the contributors (that may be not so reliable).
- This reply was modified 9 years ago by sakrelaasta.
- This reply was modified 9 years ago by sakrelaasta.
July 29, 2015 at 7:43 am #1658

sakrelaasta
Participant

to clarify:

the “general” database that I mentioned above, as I imagine it, it will be open just for the developers and researchers. I just think that it will be very useful for us to have a quick check of an idea if we have access to a great number of random scans.

It could work in conjunction with the forum. For example if someone has an idea of “Geographic identification agricultural products” he could mention it on the forum, and then the any of the contributors wanted could scan easily 3-4 plants and “throw” the scans in to that “general” database and then the one with the idea could do a quick check to see if there is any potential there.
Author

Posts

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic.