Hello world,
I am working on a little mod-bot, which will assist in the archival of sunken threads.
It takes in account two major elements: posters and words in a thread, as well as post count.
Please don't spend the time telling me why I should not include post count or your views on how to make an AI,
for AI is a tedious and new field of research, and I don't want to waste this thread on a tantrum about why I made these choices
: I have excellent reasons for that and will not answer anything questionning the engineering approach.
The thing is being discussed in the Mad Sci forums for more technicalities.
I am preparing the Knowledge base of this bot in xml, as a tree telling who are considered good posters for a given forum,
and what are considered good/bad words.
So what I need here is the Asylum to tell me Guru names by forum: it's a difficult task, but I am asking to vote
for the people you think were most useful, Mad-Scis or not, in a given forum.
Somebody, please, do something that makes sense about this...
Wether you understand it or not, what is being developped here will make for a better threads preservation, better archival, less redundant informations, less space and bandwidth load on the Asylum server, therefore, faster access speeds, less hassle for moderators who will be able to focus on other tasks.
This is reallyreallyreally important.
And I am not asking anybody to get judgemental, I am asking for informations that will be priceless for me.
Please, everybody, let down the ego, walk in my nerdy shoes for a second, and try to see things this way: I am able to deliver
a golden enhancement to the Asylum, I need you to dare pick a few mods you like. Without you, I am not getting anywhere.
And if you really want to give me tips about how to develop, or have question, email away at brundle21 at hotmail * com, but keep this thread safe from things that are not what I am asking for.
Please. Some Mad Sci, back me up on this one, I am striving to get this thing done and to avoid posts like Hughe's (Mad Sci forum)
to spawn interferences.
_Mauro, calm down mate. Give people a chance to read, digest and reply first, hmm? Before you go off assuming no one cares or that the world is out to get you, wait a day or so, or maybe two, for replies. I, personally, am not going to venture an opinion one way or the other on this, for now. I've got too much going on to be drawn into any kind of debate/whatever-the-hell-else. Give me a few days, then I'll say my piece.
Precision. Clockwork. Don't mean to be harsh, but am 150% percent focusing on getting it right. Can't explain being a nerd. Can get the job done though. Kinda like C3PO...
Do you want the guru and the forum name, or just a bunch of gurus?
DHTML/Javascript => poi/Ini
CSS - DOM - XHTML - XML - XSL - XSLT => reiso/Blaise!
Stupid Basic HTML => reiso
Photoshop => Tao
Philosophy and other Silliness => jade
Blaise, I'd marry you. That's exactly it, e-x-a-c-t-l-y. Skaarjj, I understand, thank you, there's no hurry, just want to avoid the tantrum which gets confusing to me at all costs. Some day, I'll post something about an extreme coder's dont's, to let people understand why I sound like that at times.
Entire history, please categorize them though. Entire, entire, entire history, and even alternate or past nicknames.
Because we want to be able to treat any thread, and we want the system to recognize those good poster if/when they come back.
DL-44 is in the category named "global", which means the guy adds value to the thread regardless of the forum.
Btw, this is another category that matters, GLOBAL. Doc does that too, he makes sense wherever he posts.
From: Rochester, New York, USA Insane since: May 2000
posted 05-11-2006 19:36
I would also add that a post by Weadah, TwiTch^, Mikey Milker, eyezaer would tend to be good threads.
Emperor would be another one which you might want to put some relvance on, but you might be careful with that one as he touched almost every thread there for a while.
If anyone has a copy of the old site's post counts those at the top with a few exceptions tended to be the ones giving a lot of positive information and threads they touched tended to be worthwhile.
"Please don't spend the time telling me why I should not ..."
"don't want to waste this thread on a tantrum about why I ..."
"let down the ego"
"Wether you understand it or not ..."
My God Ini, you sure know how to ask for help. My list would be similar to DL's except I'd include Ramasax, Pugzly and the Doc.
Basically, the bot is wondering about stuff like "hmmm... this poster adds value to this thread and has posted x times among y post. Besides, among A words, there are B percent positive keywords, relevant to this or
that topic. The poster tends to add more value to this kind of forum, and these keywords corresponding to this and that and that forum. In addition, there is a high post count for this thread."
With a well balanced averaging on these factors, this simple, default model can model all sorts of relationships between poster, words, corresponding forums, and the notion of "relevance".
It was a pain to explain, and a pain to get right at first. Hence the early "panick attack", but I think I got it right, so now you can crit it while adding to it, and suggest ideas.
But I think I can prove most discussion situations are covered by a model like this with an appropriate set of rules to balance the factors.
If I can't, more power to the people, and we can correct a wrong foundation. Sounds right?
interesting initiative and a cool application, I hadly understand the beginnings of AI but it's interesting & cool just the same
I'd like to add DL-44 to Photoshop, Big sigs & CSS - DOM...
Looked through your XML file there _Mauro, I don't know how you plan to match posters against posts, but if you are using nicknames you need some spell checks on them nocknames here and there
Then another Q, (yes I know it's tech but it might help in the selection of Gurus)
How much weight do you put on frequency of posts from a poster if the other criterias are filled for the post? I'm wondering basically since there are ppl that post seldom but put a lot of energy into the posts they actually write (I know that I tend to be one of them) and I don't want those posts to fall between the cracks.
Good luck with this _Mauro! It will be very interesting to see how it turns out
/D
{cell 260}{Blog}
-{" Computer:
?As a Quantum Supercomputer I take advantage of Zeno?s paradox?
C: ?Imagine a photon that must travel from A to B. The photon travels half the distance to B. From its new location it travels half the new distance. And again, it travels half that new distance.?
C: ?It continually works to get to B but it never arrives.?
Human: ?So you keep getting closer to finishing your task but never actually do??
C: ?Hey, coders make a living doing that??
?}-
It's really the beginning of such a thing that is hard to get right, and it's hard to avoid such a thing drifting immediately into too much complexity, but from now on, all corrections
from spell checks on nocknames to <anything> can happen, I don't have a problem with any kind of input past this point.
I'll try to expand a little bit...
- a thread has a "volume", probably word count. Meaning that 100% of a thread = all words, regardless of thread post count.
- the thread post count gets used like a boolean fact of some sort: "thread contains more than x posts" yes/no, and has a low impact on the final balance. I'd say a 10th, maybe a 5th.
- keywords, bad and good, can be sentences as you can see. "thanks for sharing" typically means additional value for the thread. how much depends on the word count of the thread.
This is still to determine, when implementing the actual rules (these are only the facts of the facts/knowledge base), but the keyword added value is either general, in which case
you get a global impact on the thread worth, or associated to a forum, in which case you get a "likeliness to belong to that forum" AND added value on a global scale.
- good posters appearances, again, should be used as a fraction of the thread post count, and should add to the global value, again, as well to the "relevance to this or that forum" factor.
This is a rough draft of the next, most tedious part of the project: setting and tuning the rules.
So, to answer your question, a small thread's relevance would be treated like a big thread, one single post of a goodposter would highly impact it's value (and relevance to topics),
and since one single intense posts contains more words, it is more likely to contain keywords.
If your thread is that intense and the rules are finely balanced, it would just be kept.
....
Hell, I think that when the autobot goes live, it should first just *hint* at a classification, then, later on, classify, but leave out unclassified entries, leave them in the sink, those that do not seem relevant.
It shouldn't be doing a straightforward job of massacrating randomly, but should leverage the real mod's task by 90% by reducing the archival load appropriately
and ONLY WHEN IT IS REALLY SURE IT'S DOING A GOOD JOB.
thanks for the guru praises I wish I could post more often, but as of now most of the stuffs I do are covered by this NDA thing ... or part of my next JavaScript demo.
DmS: I guess the overall post frequency ( i.e: post count / Insane since ) of a 'guru' will influence the relative weight of each of his/her posts. If a 'guru' has a low post frequency, then it must be a bad ass guru and each of his/her weighs its weigh of black pills.
_Mauro: How do you know what are the positive keywords, relevant to this or that topic ?
Ok _Mauro, that clears things up a bit for me.
As you say, as long as all the important parameters are found initilally, the weight and priority of each can be adjusted at a later point Adding parameters afterwards is way more complicated since every new one basically resets all the properties for the existing ones.
It can quite quickly become very complex to tune all the rules though if there are too many parameters.
I've been slightly involved with fraud detection in online gaming and it's hell to solidly detect patterns in large volumes adn many parameters involved, it can be done, and it is being done, but it's hard to be sure. Cannot share any of it though, propreitary code.
And for starting out with hinting a thread during training, all 10 thumbs up!
That will give a very good indication on how it works.
I'll back out of this now and let the thread go back to what it's for.
Selecting posters
And here's another Guru:
Steve for Multimedia/flash & photoshop (perhaps not frequent but darn good)
And thanx DL it's really appreciated. I like to share
/D
{cell 260}{Blog}
-{" Computer:
?As a Quantum Supercomputer I take advantage of Zeno?s paradox?
C: ?Imagine a photon that must travel from A to B. The photon travels half the distance to B. From its new location it travels half the new distance. And again, it travels half that new distance.?
C: ?It continually works to get to B but it never arrives.?
Human: ?So you keep getting closer to finishing your task but never actually do??
C: ?Hey, coders make a living doing that??
?}-
quote:
Mauro: How do you know what are the positive keywords, relevant to this or that topic ?
They're not words only, but motos, expressions, commonplaces relevant to the topic. I have an update btw, which will help me explain, and hopefully live up to the challenge: http://www.beyondwonderland.com/asylum/knowledgebase.xml
This is really just a matter of tuning, but everybody should help me tune. For many reasons.
For one, it's hard to give "ratings per forum" to gurus, I mean: who am I to judge all alone?
But we have to be honest to the machine, and after all, this pseudo-"mark" just means "relevance of poster to a topic". This is the crux, sadly, oddly, funnily.
While two forums share similar keywords, they don't share the same good posters, and they don't share ALL goodkeywords, so the relevance will fall on one side or another.
Some ambiguities will appear, and this leads us to...
The crunchy bit.
When those ambiguous moments come, the ES must ask a human being for an additional "fact" that will help make things fall in one category or the other.
For example, and additional keyword.
OR.
An additional key poster.
Which reduces the failure rate as the facts base grows until it really tends towards 0.
Hence the three major stages in developping such an app.
So we should really fill this up according to this structure by expressing our "own private chart".
Posters, in my xml, who are not associated to a specific forum are global posters.
Doc O is the only one who has a "permanent" rating of 5.0. Not sure this is honest, but in the spirit of teaching the AI this is Doc O's place, and due to the fact most of the work
here has been enhanced by the ideas spread by the good Doc, I think it is.
Another nice game: finding the right "key expressions", "motos", that reflect the spirit of this or that forum, and the relevance or irrelevance of threads.
Oh, and I'll take in account your suggestions, but should sum them up for voting at some point, as I really am not able to judge everything here at a glance,
and it is not my duty on this project.
One good news though: the more we throw good posters in, with an accurate rating for relevance, the more accurate the ES will grow because it will have more facts to study.
Ok, I am starting to really like how things turn out.
It still is xml only, but now it is stuffed with "keywords"... which only matter in terms of relevance.
So, to spot them, I just browse through old threads from the sink and highlight wathever sounds.. relevant to a general topic (DHTML, Photoshop, etc.. forums).
And I have an example of a small thread that can be used to test this system.. www.ozoneasylum.com/27573
Out of 32 meaningful words, there are 4 "keyphrases" (1/8 of the whole wordcount), with a relevance to the Ozone forum and an overall relevance.
The ES would, in it's infancy, catch this as good and mark it as archivable under Ozone.
The plan is to let people directly edit this. Only the listed posters... I'll soon post an url where you can register, either for a mailbox @beyondwonderland.com or with your own mailbox, to be a knowledge base reviewer.
Then people who care to edit this will produce "temp" iterations of the xml document, and they will receive a copy, as well as me.
All temporary iterations will made be public for everyone to be able to re-read them.
From then on, once the KB is full enough, the "real deal" can start and the bot can get live.
You are prompted for the following password / username:
guest
expertsystems
And you can then subscribe to be a reviewer of the automod system, giving you access to all the updates,
and options to access and edit the data contained in the knowledge base.
Reviewers who are part of the "good posters" list will be the only ones accepted during the first weeks of usage of this system.
Later on, when versionning is stable and a workflow / framework have been tested toroughly, all subscriptions
will be processed and everybody will have access as a potential reviewer.
When the AI goes live, it will also ask good posters questions relevant to theyre forums of expertise, to improve it's knwoledge.
The accounts won't be enabled immediately though, as lots of things are still in the works behind the scenes, a day or two are required for me
to setup the whole set of options (kb edition with versionning, kb drafts publication, etc.).
Versionning will allow keeping and comparing and merging and publishing a lot of versions of the kb, while using only one.
And it needs other reviewers than me, otherwise, the AI will learn it's basics from me - only - which is bad in any event.
It already is able to detect threads for these topics based on keywords. Good posters and bad keywords should definitely result in the expected result, eg. a good filter.
80-90% of posts archived at each shot.
Still, it is quite slow for the moment. It does a whole lot of computations, 10-20 seconds for 76 pages against one forum's keywords, when run as a batch script (what I do here is run it as a batch, and cat the output to an html file). Could improve this by using other sorting / filtering methods maybe,
or a finer memory management. Dunno. Can surely spare 20% more percent of execution time easilly.
All in all, it should be made for 100 pages shots. It would take 5 minutes to safely archive 80-90 threads out of 100. Run it twice a day, and the sink is empty within a couple of weeks.
Of course, the knwoledge base has to be enhanced AND automated: when the bot doubts, as I said, it should ask.
Well, assuming it is a serious question, a recommended thread is recommended for archival in the corresponding forum. >= 50% relevance to that forum.
For the moment, this is only the keyword search. Averaging this with good posters count * rating per forum will make it even more accurate.
The post count will count as a small bonus, 10% max.
It seems this doesn't take into account the original forum it was posted in. I think that will be a key piece of information for archiving.
I only took a quck browse through, but there seems to be a lot of overlap (same thread recommended for archival in multiple archives), and a lot that are recommended for archival in inappropriate areas (a POSER thread presumably from the 3D forum in the photoshop archive, several server-side questions in the Ozone, a server-side in the Photoshop, etc).
Forgive me if this has been addressed already, I've only skimmed the latest threads on thsi subject - just an observation.
Thank you for bringing this up.
The original forum was not shown in the sink thread, so I could not analyse it. Anyway, I think some threads from this or that forum are relevant to another forum at times, but...
You're right in that human mods do move threads to the right place, so having and using this info would make things a lot easier.
This information belongs to TP for the time being, and I conceived my thing without taking it in account because as the sink stands, it's not available.
The rating overlap is normal, since some threads are relevant to different topics at the same time... server-side threads are relevant to coding, but not to dhtml.
So basically, for the moment, the engine guess where it should put relevant threads based on the highest rating.
Therefore, threads that overlap belong more to one forum than another.
And only scores higher than 50% are meaningful to the engine.
When you see things in that light, and you know the engine would only process according to the highest rating, and only ratings above 50%, it makes much more sense and seems to get it right
everytime.
Give it a second check with these info and let me know... (not that your advice is not welcome: it's actually the kind of fine tuning helper pointers I need from now on - so really, let me know)
Attempt at laying out the rules:
A = global goodkeywords rate per page
B = global badkeywords rate per page
C = global goodposters rate per page
D = forum specific goodkeywords rate per page
E = forum specific badkeywords rate per page
F = forum specific goodposters rate per page
G = Archivability rate = ((A-B) + C) / 2 + 10 IF I is true
H = Relevance to a specific forum rate = (((D-E) + F) / 2)
I = Postcount >= 30
J = If page archivability >= 50 archive thread in archive defined by max(H)
Yeaaah... that's about it. These aren't the comprehensive details of the implementation, these are the rules that match the knowledge base definition to fullfil the job.
Right now, only rule D is executed.
In the end, rule J will be called on all pages, and will recursively evaluate all the rest.
In real world, it just works. I mean, if only getting D already tells me where most posts belong, having the whole picture and good facts base will do a hell of a job.
When you see things in that light, and you know the engine would only process according to the highest rating, and only ratings above 50%, it makes much more sense and seems to get it right
everytime.
Give it a second check with these info and let me know... (not that your advice is not welcome: it's actually the kind of fine tuning helper pointers I need from now on - so really, let me know)
Certainly, as a concept, this makes sense. With the particular threads that I did look at, it just didn't fit though. I didn't take specific info from them, but basically there were a hanful of very decidedly server-side coding oriented posts I looked at (about 5 that I clicked on) that were set to be archived in the Ozone forum. While surely almost any post *could* fit in the Ozone forum, these should definitely have been in the Server Side. There was one that I clicked on that was a server-side coding question, set to archive in the Photoshop section.
Obviously this would be more helpful had I logged the threads for you to look at - will keep that in mind in the future
Now, as for the forum of origin - I just went back to check, and sure enough I cannot find that info. I seem to recall, however, that on hover, the forum of origin used to pop-up (via the title attribute), but now all it says is "from the sink".
Is this something that has changed, or is my memory faulty?
Having the original forum is invaluable, whether the system is human run or system run...
I think this is a very interesting concept overall, and very worthy of following. The human method, while more trustworthy in some aspects, is clearly not working all that great at the moment. In addition to being very subjective, it relies on people being available to do the work when required.
With 267 pages worth of threads in the sink, clearly the people have not been available
quote:With 267 pages worth of threads in the sink, clearly the people have not been available
well, i did a lot of "sinking" but it was nearly impossible to do it so often so that the sink would have become smaller. skip 2 days and nearly a whole new page will be waiting for you. we would have needed more "humans"....MANY more...and these have not been available, as you mentioned it correctly
From: The Land of one Headlight on. Insane since: May 2001
posted 05-15-2006 17:11
Probably way off base with this thought... but here goes anyway.
Under 'goodkeywords' how about something like "Please archive this" or some similar phrase.
For example let's say I'm following a thread in 'Server-Side Scripting - Oh my!' and I'm finding it pretty informative and want to make sure it gets archived.
Could I simply post "Please archive this" and then when the "mod-bot"
goes about its business it knows for sure to archive that particular thread?
Workable? Probably not but thought I'd throw it out anyway.
Well, such a system already exists, I mean, preservation words "can" be added, and this has been here for a long time.
The fact is: people don't use it as much as they should.
I am using good keywords to really evaluate the relevance to a forum, actually, the whole concept took in account the fact a thread can be posted in a given forum
but relevant to another (if I post a thread in dhtml regarding Java, it would be better in the s-side archive).
So yes, your idea could work, but in reality, it didn't. Probably didn't catch up.
My proposal has this advantage: once it is finely tuned, it does 90% of the archival alone and when it finds threads with an ambiguous classification, asks for more keywords or more info
about a poster to be able to better classify in the future.
This would leave the following tasks to real world mods: moderating, eg. closing threads and moving relevant threads to the archive where they belong.
Furthermore, the bot would "actively" ask for the tiny-winy bit of advice it needs when it needs it, and only then.
Of course, if we consider threads already "are" in the forum where they should be archived, my duty becomes a lot smaller, but I think it would be wrong... the more I think about it,
the less it makes sense: why not automatically archiving EVERYTHING to the relevant archive then? Because some threads ARE ambiguous and not everything is interesting.
Better make the engine really "guess" the relevance instead of forcing it a bit.
------------------------------------
The goal here is making it clever enough for it to leverage 90% of the human mods duty, and it can be done if we are sure it only archives what clearly belongs to one archive or another.
Spotting threads that are poorly classified and reporting to me is a great thing to do: they "show" they keywords I need to balance in plain english.
For instance, I have found that the keyword "new" is very good for identyfying threads worthy of archival, but terrible when it comes to identyfying the forum
of a thread, because we have "new" anything everyday.
Useless keywords, poorly chosen, also tend to reduce the whole relevance score of a given forum: spotting them immediately improves relevance selection for a forum.
So, three things:
1) I really think the mod-bot should be as intelligent as possible, as opposed to systematic, it should "guess", and the information of originating forum could be useless, or too much of a constraint.
2) A sticky label saying "archive this" would not do the job, as it still would require human mods to select threads to archive, still.
3) Please report threads that don't seem well classified in the above links.
And feel free to question what already exists: since I have something in my hands, it's easy to make and quickly test assumptions and hypothesis, now.
Quick stats regarding the current version. I think I can double the current speed of the app, but...
- atm, 1/3rd of a second to process one thread against a forum keywords
- an overall high score when processing the whole set of 70 sample threads against all forums, almost everything gets associated to a forum, and many things are relevant.
- same time required to process all threads against posters
12 archives vs 17 forums.... so originating forum is not relevant, DL, mods already "should" make choices regardless of the origin of a thread.
And I can now make a rough estimate of the global processing time for the whole sink as it stands. Funny, if nothing else:
(267 * 50 * 1/3 * 12) / 3600 = 14.83~ hours to run through the whole sink.
(would be better running it more regularly on chunks of 5 sink pages for instance, would only take 30 minutes - 250 threads or so, 220 would be processed - twice a day means 440 threads archived a day).
As a single run takes some time, I will seldomly give html updates from now on, only when they are really interesting and bring something brand new.
4 minutes and something to process 100 thread, identified something like 34 threads.
I am currently calibrating the "posters" part which, when added, should add 20 threads to the score, more or less.
Ok, the mod-bot will get live really really soon: details of it's integration to the Asylum are being discussed in the mad-sci forum.
This said, at the moment, it is very good at spotting threads from the s-side and dhtml forums, just like... me.
Because it has been built and educated by... me As I said, as an AI, it has a point of view, and this point of view should never stop evolving.
So, the system is built to "know" when it has difficulties classifying threads, but it should be able to ask someone what to do
with those threads, and why.
And to be really accurate, it should receive input from various Asylum users, not just one.
Ideally, mods, and ideally, those who are listed as posters in the Knowledge base.
So, this needs human beings and needs to be able to contact those human beings.
SMS, email, IM, your pick, but please, if you would like to help it learn and progress in the future, let me know.
Either drop me an email @mauro@beyondwonderlandDOTc-o_m or a post right here.
I would like to reach 85% threads fetched at each shot.
2"6 per thread is alright but can be improved.
A few threads are not archived correctly (spotted maybe 4 of them), but in all these cases, the second choice is the right one (meaning the kb just needs a finer balance for some cases).
Will soon release the associated web service, but first I have to improve / optimize this foundation.
(it needs to get past a certain performance treshold to really get live, this will ensure it improves itself over time..)
A couple more open questions, it's fine tuning time <insert sly smile here>
Can you think of "Asylum events" and help me list them? Photoshop pong, twenty liners, but also big sig, repeat performance and?...
Also, one forum I had never really paid attention to is the Photography forum: it's full of beautiful work!!!
Where should I teach automod to classify these, since there is no Photography archive? Miscellany or Photoshop or?...
And for that forum, I'll add the concerned mods to the "good posters" list: Shiiizam, krets... who am I leaving out?
Thanks in advance.
Tyberius Prime
Maniac (V) Mad Scientist with Finglongers
From: Germany Insane since: Sep 2001
posted 06-08-2006 20:49
photograpyh: go by 'still valid image links' - and there'll be an archive for each forum, I guess.
Events:
PSPong - again, check for matches where the pictuers are still available.
signature contest
repeat performance
the photography challenges
don't forget those 'we are telling a big story, post by post' threads that pop up around christmas
birthday congratulations
20 liners