Track statistics on the openSUSE staging process to gain feedback on changes

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Track statistics on the openSUSE staging process to gain feedback on changes

Jimmy Berry-2
I am looking to provide a variety of statistics and metrics relating to the
staging workflow on OBS in onder to see the impact of automation and tune the
tools. I have spent some time setting up a local OBS, obs_factory engine,
looking through the database structure, and reviewing tools designed to
present metrics. My understanding from @hennevogel is that Influx Data is
planned to be used to provide this type of information and it makes sense to
get a feel for the existing plans and how this may fit into that.

The biggest hurdle seems to be trying to create timeseries information from
event data by walking the events backwards from the known state (ie current)
to find states of interest. At some point whatever solution is built will need
access to the relevant data to do some aggregation and store intermediate
results that can then be drawn on for presentation. Alternatively, I had some
success using sub-selects, but that is likely not the most performant way
forward.

Given the design of the submit request/staging workflow, that no events are
recorded after a request is declined, it is not possible to determine when a
obsolete request was unstaged. The overall state of stagings cannot be
accurately determined when such a workflow has occurred. It will likely also
be difficult/impossible to determine when a build state complete, re-entered
building due to manual rebuilds or re-freeze, or failed/passed testing. The
dashboard [1] already provides this information, but re-creating a history in
an aggregate form is likely near impossible if not very difficult. As such it
likely makes sense to create a new polling job that stores facets of the
information collected by the dashboard in a timerseries database.

For the rest of the information tied specific to individual requests it seems
possible to figure everything out from a handful of tables in the OBS
database. Alternatively, I can write something that polls and scrapes data
using the APIs into local storage, but that seems like unnecessary extra work.
Is it feasible for me to be granted read access to the few tables of interest
or the Influx/similar tool setup with access so that I can begin setting up
some metrics of interest?

A bit more on the topic of generating timeseries data. As an example, consider
presenting a graph of the request backlog against Factory over time, the time
until first staging, or the number of empty stagings over time. The event
information is collected in the form of reviews. Anytime the request is staged
or re-staged reviews for the particular staging project are added or accepted.
Accepting or declining the request unfortunately stops future changes from
being recorded which means the staging tools cannot indicate when the request
is removed from a staging, but for simplicity it can be assumed complete after
one of those states.

Assuming one has a known state from which to start it should be possible to
walk the event history and annotate states of interest. Given that the current
state can be queried which indicates what requests are currently in a staging
that provides a starting point. Assuming the script can be stopped anytime it
encounters an annotation again (or other mechanism) the job can be run on some
timely basis to annotate the desired information.

The polling technique could be used to avoid all this walking the event
history which is simpler, but cannot backfill the data. As such it is
preferable to walk the event tree where possible.

I look forward to your thoughts.


[1] https://build.opensuse.org/project/staging_projects/openSUSE:Factory

--
Jimmy
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Track statistics on the openSUSE staging process to gain feedback on changes

Adrian Schröter
On Mittwoch, 1. März 2017, 01:48:19 CET wrote Jimmy Berry:

> I am looking to provide a variety of statistics and metrics relating to the
> staging workflow on OBS in onder to see the impact of automation and tune the
> tools. I have spent some time setting up a local OBS, obs_factory engine,
> looking through the database structure, and reviewing tools designed to
> present metrics. My understanding from @hennevogel is that Influx Data is
> planned to be used to provide this type of information and it makes sense to
> get a feel for the existing plans and how this may fit into that.
>
> The biggest hurdle seems to be trying to create timeseries information from
> event data by walking the events backwards from the known state (ie current)
> to find states of interest. At some point whatever solution is built will need
> access to the relevant data to do some aggregation and store intermediate
> results that can then be drawn on for presentation. Alternatively, I had some
> success using sub-selects, but that is likely not the most performant way
> forward.
>
> Given the design of the submit request/staging workflow, that no events are
> recorded after a request is declined, it is not possible to determine when a
> obsolete request was unstaged. The overall state of stagings cannot be
> accurately determined when such a workflow has occurred. It will likely also
> be difficult/impossible to determine when a build state complete, re-entered
> building due to manual rebuilds or re-freeze, or failed/passed testing. The
> dashboard [1] already provides this information, but re-creating a history in
> an aggregate form is likely near impossible if not very difficult. As such it
> likely makes sense to create a new polling job that stores facets of the
> information collected by the dashboard in a timerseries database.
>
> For the rest of the information tied specific to individual requests it seems
> possible to figure everything out from a handful of tables in the OBS
> database.

We did this for the maintenance statistics:

 osc api /statistics/maintenance_statistics/openSUSE:Maintenance:6433

it also handles assignments from a group to a user to calculate how
long the group took to review.

> Alternatively, I can write something that polls and scrapes data
> using the APIs into local storage, but that seems like unnecessary extra work.
> Is it feasible for me to be granted read access to the few tables of interest
> or the Influx/similar tool setup with access so that I can begin setting up
> some metrics of interest?

Your code is very much isolated, so I can't really judge about it.
Just one hint, your code runs in an environemnt and host which is critical
for our security. And also for all people using repositories from it.

So another service means another potential weakness, it would be good if
that does not need to run on our main server at least.

> A bit more on the topic of generating timeseries data. As an example, consider
> presenting a graph of the request backlog against Factory over time, the time
> until first staging, or the number of empty stagings over time. The event
> information is collected in the form of reviews. Anytime the request is staged
> or re-staged reviews for the particular staging project are added or accepted.
> Accepting or declining the request unfortunately stops future changes from
> being recorded which means the staging tools cannot indicate when the request
> is removed from a staging, but for simplicity it can be assumed complete after
> one of those states.

I think most of this is generic and not specific to staging projects.
So it would be good to extend the generic request system with statistics IMHO.

> Assuming one has a known state from which to start it should be possible to
> walk the event history and annotate states of interest. Given that the current
> state can be queried which indicates what requests are currently in a staging
> that provides a starting point. Assuming the script can be stopped anytime it
> encounters an annotation again (or other mechanism) the job can be run on some
> timely basis to annotate the desired information.
>
> The polling technique could be used to avoid all this walking the event
> history which is simpler, but cannot backfill the data. As such it is
> preferable to walk the event tree where possible.
>
> I look forward to your thoughts.
>
>
> [1] https://build.opensuse.org/project/staging_projects/openSUSE:Factory
>
>


--

Adrian Schroeter
email: [hidden email]

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
 
Maxfeldstraße 5                        
90409 Nürnberg
Germany


--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Track statistics on the openSUSE staging process to gain feedback on changes

Jimmy Berry-2
On Wednesday, March 1, 2017 9:54:14 AM CST Adrian Schröter wrote:

> On Mittwoch, 1. März 2017, 01:48:19 CET wrote Jimmy Berry:
> > I am looking to provide a variety of statistics and metrics relating to
> > the
> > staging workflow on OBS in onder to see the impact of automation and tune
> > the tools. I have spent some time setting up a local OBS, obs_factory
> > engine, looking through the database structure, and reviewing tools
> > designed to present metrics. My understanding from @hennevogel is that
> > Influx Data is planned to be used to provide this type of information and
> > it makes sense to get a feel for the existing plans and how this may fit
> > into that.
> >
> > The biggest hurdle seems to be trying to create timeseries information
> > from
> > event data by walking the events backwards from the known state (ie
> > current) to find states of interest. At some point whatever solution is
> > built will need access to the relevant data to do some aggregation and
> > store intermediate results that can then be drawn on for presentation.
> > Alternatively, I had some success using sub-selects, but that is likely
> > not the most performant way forward.
> >
> > Given the design of the submit request/staging workflow, that no events
> > are
> > recorded after a request is declined, it is not possible to determine when
> > a obsolete request was unstaged. The overall state of stagings cannot be
> > accurately determined when such a workflow has occurred. It will likely
> > also be difficult/impossible to determine when a build state complete,
> > re-entered building due to manual rebuilds or re-freeze, or failed/passed
> > testing. The dashboard [1] already provides this information, but
> > re-creating a history in an aggregate form is likely near impossible if
> > not very difficult. As such it likely makes sense to create a new polling
> > job that stores facets of the information collected by the dashboard in a
> > timerseries database.
> >
> > For the rest of the information tied specific to individual requests it
> > seems possible to figure everything out from a handful of tables in the
> > OBS database.
>
> We did this for the maintenance statistics:
>
>  osc api /statistics/maintenance_statistics/openSUSE:Maintenance:6433
>
> it also handles assignments from a group to a user to calculate how
> long the group took to review.

This looks somewhat similar although the example provided does not have any
reviews, but based on your comment I can assume what that might look like.
What tool consumes this API? Presumably the tool then has to scrape all the
information from this API in a similar manor to what I was trying to avoid.

>
> > Alternatively, I can write something that polls and scrapes data
> > using the APIs into local storage, but that seems like unnecessary extra
> > work. Is it feasible for me to be granted read access to the few tables
> > of interest or the Influx/similar tool setup with access so that I can
> > begin setting up some metrics of interest?
>
> Your code is very much isolated, so I can't really judge about it.
> Just one hint, your code runs in an environemnt and host which is critical
> for our security. And also for all people using repositories from it.
>
> So another service means another potential weakness, it would be good if
> that does not need to run on our main server at least.

Any code I mention at this point is running on my local machine. If read
access to the source tables is not possible the code can be hosted entirely
separate from OBS.

>
> > A bit more on the topic of generating timeseries data. As an example,
> > consider presenting a graph of the request backlog against Factory over
> > time, the time until first staging, or the number of empty stagings over
> > time. The event information is collected in the form of reviews. Anytime
> > the request is staged or re-staged reviews for the particular staging
> > project are added or accepted. Accepting or declining the request
> > unfortunately stops future changes from being recorded which means the
> > staging tools cannot indicate when the request is removed from a staging,
> > but for simplicity it can be assumed complete after one of those states.
>
> I think most of this is generic and not specific to staging projects.
> So it would be good to extend the generic request system with statistics
> IMHO.

That was my understanding as well, which is why I am not proceeding any
further until I get an understanding of any existing plans surrounding OBS
metrics.

> > Assuming one has a known state from which to start it should be possible
> > to
> > walk the event history and annotate states of interest. Given that the
> > current state can be queried which indicates what requests are currently
> > in a staging that provides a starting point. Assuming the script can be
> > stopped anytime it encounters an annotation again (or other mechanism)
> > the job can be run on some timely basis to annotate the desired
> > information.
> >
> > The polling technique could be used to avoid all this walking the event
> > history which is simpler, but cannot backfill the data. As such it is
> > preferable to walk the event tree where possible.
> >
> > I look forward to your thoughts.
> >
> >
> > [1] https://build.opensuse.org/project/staging_projects/openSUSE:Factory


--
Jimmy
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Track statistics on the openSUSE staging process to gain feedback on changes

Henne Vogelsang-2
Hey,

On 01.03.2017 15:43, Jimmy Berry wrote:

> That was my understanding as well, which is why I am not proceeding any
> further until I get an understanding of any existing plans surrounding OBS
> metrics.

Apart from the tool we want to use to store time series data (influxdb),
the tool we want to use to send data there (influxer) and the tool we
want to use to show metrics (grafana) we don't have much of a plan. I
guess it's up to you to figure out how you can make sense out of this
for your use case :-)

If you need to record some extra time series data for your staging
workflow engine you can do that, as your engine always runs in the
context of the OBS instance it's mounted on top of. So it will also have
access to the influxdb instance etc.

Same is BTW true for access to the SQL database, your engine has the
same access as the Rails app it's mounted from.

I hope that helps,

Henne

--
Henne Vogelsang
http://www.opensuse.org
Everybody has a plan, until they get hit.
        - Mike Tyson
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Track statistics on the openSUSE staging process to gain feedback on changes

Jimmy Berry-2
On Wednesday, March 1, 2017 5:44:58 PM CST Henne Vogelsang wrote:

> Hey,
>
> On 01.03.2017 15:43, Jimmy Berry wrote:
> > That was my understanding as well, which is why I am not proceeding any
> > further until I get an understanding of any existing plans surrounding OBS
> > metrics.
>
> Apart from the tool we want to use to store time series data (influxdb),
> the tool we want to use to send data there (influxer) and the tool we
> want to use to show metrics (grafana) we don't have much of a plan. I
> guess it's up to you to figure out how you can make sense out of this
> for your use case :-)
>
> If you need to record some extra time series data for your staging
> workflow engine you can do that, as your engine always runs in the
> context of the OBS instance it's mounted on top of. So it will also have
> access to the influxdb instance etc.
>
> Same is BTW true for access to the SQL database, your engine has the
> same access as the Rails app it's mounted from.

As I would expect. I was looking for access to develop against since it is
difficult to recreate an accurate facsimile of the OBS instance and near
impossible to simulate the variety of workflows through which requests have
gone. It would also be good to see if pulling certain metrics directly from
the source tables is performant enough.

When I worked on the tooling used by the development site for other open
source projects it was possible to get a sanitized database dump or staging
environment that had access to both a clone of production and read access to
production. These resources were invaluable for validating data migrations and
tools before deployment. Without such access it was impossible to predict all
the ways in which data can be either inconsistent, corrupted, or odd edge-
cases.

Given that storing additional information will not cover all the desired
metrics it is likely more effective to just record timeseries data. I'll have
to look at the tool in question, but I would expect a background job to run
that periodically writes a record to the timeseries database. Such a
background job that will end up storing data outside of the scope of
obs_factory. On that note, are the various influx software pieces setup and
hosted or has nothing been done short of selecting the desired tool?

Short of database read access to where I can potentially run some of these
tools myself and figure out how to set things up I am not really sure how I
can proceed. Either I spend my time scraping the data via the APIs or writing
a scripts to generate data to develop against. Both of which seem like
unnecessary extra effort given the real deal already exists.

I am happy to put in effort to make this happen, but I'd rather not beat
around the bush recreating data that may or may not properly represent the
real data. Even if I can somehow put everything in the obs_factory engine that
does not help me develop it.

>
> I hope that helps,
>
> Henne

Thanks,

--
Jimmy
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Track statistics on the openSUSE staging process to gain feedback on changes

Henne Vogelsang-2
Hey,

On 01.03.2017 22:23, Jimmy Berry wrote:

> On Wednesday, March 1, 2017 5:44:58 PM CST Henne Vogelsang wrote:
>>
>> If you need to record some extra time series data for your staging
>> workflow engine you can do that, as your engine always runs in the
>> context of the OBS instance it's mounted on top of. So it will also have
>> access to the influxdb instance etc.
>>
>> Same is BTW true for access to the SQL database, your engine has the
>> same access as the Rails app it's mounted from.
>
> As I would expect. I was looking for access to develop against since it is
> difficult to recreate an accurate facsimile of the OBS instance and near
> impossible to simulate the variety of workflows through which requests have
> gone.

I very much doubt that. We have an extensive test suite that is already
'simulating' all major workflows, including requests of the various
kinds. For creating data you can use the tooling that exists, like our
data factories[1]. If you need help with this do not hesitate to contact
me :-)

> It would also be good to see if pulling certain metrics directly from
> the source tables is performant enough.

Aren't you getting ahead of yourself? Why don't you first figure out
what you want to do and how and then worry about performance of the
production DB :-)

> When I worked on the tooling used by the development site for other open
> source projects it was possible to get a sanitized database dump or staging
> environment that had access to both a clone of production and read access to
> production. These resources were invaluable for validating data migrations and
> tools before deployment.

This is a good practice that we also follow. But what has this to do
with your tool? You are neither migrating nor deploying...

 > Without such access it was impossible to predict all
> the ways in which data can be either inconsistent, corrupted, or odd edge-
> cases.

Again you are getting ahead of yourself I think. We have a very well
documented data structure. If something is inconsistent, corrupted or an
odd edge case it is by our definition broken. If you come across such a
case you should tell us or better yet fix that case :-)

> Given that storing additional information will not cover all the desired
> metrics it is likely more effective to just record timeseries data. I'll have
> to look at the tool in question, but I would expect a background job to run
> that periodically writes a record to the timeseries database.

No, the contrary. Every time something happens a data point get's
recorded into a data set in the time series DB. So let's say a request
is closed. You would record the fact, the time, add some tags describing
the resolution (accepted, decline) or the user who did this etc. Once
you have this data in the time series DB you can query and display it :-)

> On that note, are the various influx software pieces setup and
> hosted or has nothing been done short of selecting the desired tool?

No nothing is done yet. Just planed, sorry.

Henne

[1]
https://github.com/openSUSE/open-build-service/tree/master/src/api/spec/factories

--
Henne Vogelsang
http://www.opensuse.org
Everybody has a plan, until they get hit.
        - Mike Tyson
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Track statistics on the openSUSE staging process to gain feedback on changes

Jimmy Berry-2
On Tuesday, March 7, 2017 3:27:11 PM CDT Henne Vogelsang wrote:

> Hey,
>
> On 01.03.2017 22:23, Jimmy Berry wrote:
> > On Wednesday, March 1, 2017 5:44:58 PM CST Henne Vogelsang wrote:
> >> If you need to record some extra time series data for your staging
> >> workflow engine you can do that, as your engine always runs in the
> >> context of the OBS instance it's mounted on top of. So it will also have
> >> access to the influxdb instance etc.
> >>
> >> Same is BTW true for access to the SQL database, your engine has the
> >> same access as the Rails app it's mounted from.
> >
> > As I would expect. I was looking for access to develop against since it is
> > difficult to recreate an accurate facsimile of the OBS instance and near
> > impossible to simulate the variety of workflows through which requests
> > have
> > gone.
>
> I very much doubt that. We have an extensive test suite that is already
> 'simulating' all major workflows, including requests of the various
> kinds. For creating data you can use the tooling that exists, like our
> data factories[1]. If you need help with this do not hesitate to contact
> me :-)

I skimmed through the files and I did not see anything similar to the Factory
staging workflow managed by openSUSE/osc-plugin-factory. The components of
that workflow would be covered by such tests and data creation, but is not
terribly helpful for trying to build something to extract specific statistics.
The staging workflow creates reviews when requests are staged in a particular
staging and records in which staging the request was placed.

The statistics of interest need to look for specific types of reviews related
to staging process and spacing between them and other events. The data needs
to be very specifically structure like that in the real instance.

To be clear, I already wrote a few queries locally against records I created
by hand that extract the desired information. An example of tricky data, a
request can be staged, denied, unstaged, and then reinstated. At which point
during the time it was denied no review changes will be recorded (ie the fact
that it was unstaged). This is one of the cases the tooling has to handle and
I can recreate locally so I have no doubt it occurs. Making sure the
statistics properly handle all the intricacies of the real data cannot easily
be simulated. Having done this sort of work on other live systems it is nearly
impossible to predict the interesting edge-cases in real data and is not
particularly productive to do so when compared to running against the real
thing.

>
> > It would also be good to see if pulling certain metrics directly from
> > the source tables is performant enough.
>
> Aren't you getting ahead of yourself? Why don't you first figure out
> what you want to do and how and then worry about performance of the
> production DB :-)

As noted in the original post I have quite a bit of detail in what I want to
do and a few approaches which are dependent on the performance of said
approaches. If the simplest approach performance is sufficient why spend extra
time on a more complex approach?

If others have time to get more directly involved I can document more publicly
the specifics of what I have already done, but otherwise I'll save that for
when I have a final solution.

>
> > When I worked on the tooling used by the development site for other open
> > source projects it was possible to get a sanitized database dump or
> > staging
> > environment that had access to both a clone of production and read access
> > to production. These resources were invaluable for validating data
> > migrations and tools before deployment.
>
> This is a good practice that we also follow. But what has this to do
> with your tool? You are neither migrating nor deploying...

Looking for the edge-cases in the data, especially when requests operated on
while in a denied state (as noted above).

>
>  > Without such access it was impossible to predict all
> >
> > the ways in which data can be either inconsistent, corrupted, or odd edge-
> > cases.
>
> Again you are getting ahead of yourself I think. We have a very well
> documented data structure. If something is inconsistent, corrupted or an
> odd edge case it is by our definition broken. If you come across such a
> case you should tell us or better yet fix that case :-)

I agree the data structure is documented. As noted I already wrote queries for
some of the desired information. Without running queries and scripts against
the real data I cannot find edge-cases.

>
> > Given that storing additional information will not cover all the desired
> > metrics it is likely more effective to just record timeseries data. I'll
> > have to look at the tool in question, but I would expect a background job
> > to run that periodically writes a record to the timeseries database.
>
> No, the contrary. Every time something happens a data point get's
> recorded into a data set in the time series DB. So let's say a request
> is closed. You would record the fact, the time, add some tags describing
> the resolution (accepted, decline) or the user who did this etc. Once
> you have this data in the time series DB you can query and display it :-)

I contrasted storing additional data (in the OBS structure) to storing
everything of interest in timeseries database. Indeed, having the data in a
timeseries database would work, but represents a lot of data duplication and
an entire process that as I understand does not currently exist. As such I was
hoping to avoid it and pull at least a subset directly from the existing data
structure.

>
> > On that note, are the various influx software pieces setup and
> > hosted or has nothing been done short of selecting the desired tool?
>
> No nothing is done yet. Just planed, sorry.
>
> Henne
>
> [1]
> https://github.com/openSUSE/open-build-service/tree/master/src/api/spec/fact
> ories

At this point, I am not sure what is desired to move this forward. I have a
goal of specific metrics that I would like to extract and present documented
in original post. I have done work on a local instance to see what metrics can
be extracted from the existing data and wrote queries to do so. I have
determined what information is lacking and that likely best to just have a new
process for writing such timeseries data, which sounds similar to what was
planned.

There are certain trends in the metrics I would expect to be present in the
real data that I would like to confirm. In fact the metrics that can be
extracted from the existing data may suffice if they demonstrate the things in
which I am interested, but I cannot tell them from running queries against
fake data.

I had hoped to avoid creating a data scraping tool, but if it is not possible
to gain some sort of access to the data I may just do that to avoid being
blocked. Likely I'll write the data into the same structure used by OBS so the
tool will be compatible if ever deployed properly.

Some of the data is generic to all requests and other is specific to openSUSE/
obs_factory and openSUSE/osc-factory-plugin related specifics. I have
considered building some additional API calls, perhaps some in obs_factory,
that could expose certain aggregate query results. That may be useful, but at
the moment this project is somewhat exploratory in that it will become clear
what is interesting in the data when it is explored. As such a more fluid
setup that allows for developing queries and metrics until a full picture is
clear seems to make sense rather than trying to build code and have it
deployed before even an initial result can be seen.

--
Jimmy
--
To unsubscribe, e-mail: [hidden email]
To contact the owner, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Track statistics on the openSUSE staging process to gain feedback on changes

Adrian Schröter

 

Hi Jimmy,

 

I have to admit that I don't understand which exact statistics you want

to get. It would be important for me to have a concrete description

for each meassurement you want to do. We can decide then individual

if we can provide these numbers.

 

Please create some seperate document for each of them. In case it is critical

for the project please use Fate. Otherwise some wiki page or github issue

might be sufficient.

 

Please describe what these numbers should tell and what should be the

base for for these numbers from your POV. We can discuss about the

implementation details then in a later step.

 

thanks

adrian

 

On Dienstag, 14. März 2017, 20:56:59 CET wrote Jimmy Berry:

> On Tuesday, March 7, 2017 3:27:11 PM CDT Henne Vogelsang wrote:

> > Hey,

> >

> > On 01.03.2017 22:23, Jimmy Berry wrote:

> > > On Wednesday, March 1, 2017 5:44:58 PM CST Henne Vogelsang wrote:

> > >> If you need to record some extra time series data for your staging

> > >> workflow engine you can do that, as your engine always runs in the

> > >> context of the OBS instance it's mounted on top of. So it will also have

> > >> access to the influxdb instance etc.

> > >>

> > >> Same is BTW true for access to the SQL database, your engine has the

> > >> same access as the Rails app it's mounted from.

> > >

> > > As I would expect. I was looking for access to develop against since it is

> > > difficult to recreate an accurate facsimile of the OBS instance and near

> > > impossible to simulate the variety of workflows through which requests

> > > have

> > > gone.

> >

> > I very much doubt that. We have an extensive test suite that is already

> > 'simulating' all major workflows, including requests of the various

> > kinds. For creating data you can use the tooling that exists, like our

> > data factories[1]. If you need help with this do not hesitate to contact

> > me :-)

>

> I skimmed through the files and I did not see anything similar to the Factory

> staging workflow managed by openSUSE/osc-plugin-factory. The components of

> that workflow would be covered by such tests and data creation, but is not

> terribly helpful for trying to build something to extract specific statistics.

> The staging workflow creates reviews when requests are staged in a particular

> staging and records in which staging the request was placed.

>

> The statistics of interest need to look for specific types of reviews related

> to staging process and spacing between them and other events. The data needs

> to be very specifically structure like that in the real instance.

>

> To be clear, I already wrote a few queries locally against records I created

> by hand that extract the desired information. An example of tricky data, a

> request can be staged, denied, unstaged, and then reinstated. At which point

> during the time it was denied no review changes will be recorded (ie the fact

> that it was unstaged). This is one of the cases the tooling has to handle and

> I can recreate locally so I have no doubt it occurs. Making sure the

> statistics properly handle all the intricacies of the real data cannot easily

> be simulated. Having done this sort of work on other live systems it is nearly

> impossible to predict the interesting edge-cases in real data and is not

> particularly productive to do so when compared to running against the real

> thing.

>

> >

> > > It would also be good to see if pulling certain metrics directly from

> > > the source tables is performant enough.

> >

> > Aren't you getting ahead of yourself? Why don't you first figure out

> > what you want to do and how and then worry about performance of the

> > production DB :-)

>

> As noted in the original post I have quite a bit of detail in what I want to

> do and a few approaches which are dependent on the performance of said

> approaches. If the simplest approach performance is sufficient why spend extra

> time on a more complex approach?

>

> If others have time to get more directly involved I can document more publicly

> the specifics of what I have already done, but otherwise I'll save that for

> when I have a final solution.

>

> >

> > > When I worked on the tooling used by the development site for other open

> > > source projects it was possible to get a sanitized database dump or

> > > staging

> > > environment that had access to both a clone of production and read access

> > > to production. These resources were invaluable for validating data

> > > migrations and tools before deployment.

> >

> > This is a good practice that we also follow. But what has this to do

> > with your tool? You are neither migrating nor deploying...

>

> Looking for the edge-cases in the data, especially when requests operated on

> while in a denied state (as noted above).

>

> >

> > > Without such access it was impossible to predict all

> > >

> > > the ways in which data can be either inconsistent, corrupted, or odd edge-

> > > cases.

> >

> > Again you are getting ahead of yourself I think. We have a very well

> > documented data structure. If something is inconsistent, corrupted or an

> > odd edge case it is by our definition broken. If you come across such a

> > case you should tell us or better yet fix that case :-)

>

> I agree the data structure is documented. As noted I already wrote queries for

> some of the desired information. Without running queries and scripts against

> the real data I cannot find edge-cases.

>

> >

> > > Given that storing additional information will not cover all the desired

> > > metrics it is likely more effective to just record timeseries data. I'll

> > > have to look at the tool in question, but I would expect a background job

> > > to run that periodically writes a record to the timeseries database.

> >

> > No, the contrary. Every time something happens a data point get's

> > recorded into a data set in the time series DB. So let's say a request

> > is closed. You would record the fact, the time, add some tags describing

> > the resolution (accepted, decline) or the user who did this etc. Once

> > you have this data in the time series DB you can query and display it :-)

>

> I contrasted storing additional data (in the OBS structure) to storing

> everything of interest in timeseries database. Indeed, having the data in a

> timeseries database would work, but represents a lot of data duplication and

> an entire process that as I understand does not currently exist. As such I was

> hoping to avoid it and pull at least a subset directly from the existing data

> structure.

>

> >

> > > On that note, are the various influx software pieces setup and

> > > hosted or has nothing been done short of selecting the desired tool?

> >

> > No nothing is done yet. Just planed, sorry.

> >

> > Henne

> >

> > [1]

> > https://github.com/openSUSE/open-build-service/tree/master/src/api/spec/fact

> > ories

>

> At this point, I am not sure what is desired to move this forward. I have a

> goal of specific metrics that I would like to extract and present documented

> in original post. I have done work on a local instance to see what metrics can

> be extracted from the existing data and wrote queries to do so. I have

> determined what information is lacking and that likely best to just have a new

> process for writing such timeseries data, which sounds similar to what was

> planned.

>

> There are certain trends in the metrics I would expect to be present in the

> real data that I would like to confirm. In fact the metrics that can be

> extracted from the existing data may suffice if they demonstrate the things in

> which I am interested, but I cannot tell them from running queries against

> fake data.

>

> I had hoped to avoid creating a data scraping tool, but if it is not possible

> to gain some sort of access to the data I may just do that to avoid being

> blocked. Likely I'll write the data into the same structure used by OBS so the

> tool will be compatible if ever deployed properly.

>

> Some of the data is generic to all requests and other is specific to openSUSE/

> obs_factory and openSUSE/osc-factory-plugin related specifics. I have

> considered building some additional API calls, perhaps some in obs_factory,

> that could expose certain aggregate query results. That may be useful, but at

> the moment this project is somewhat exploratory in that it will become clear

> what is interesting in the data when it is explored. As such a more fluid

> setup that allows for developing queries and metrics until a full picture is

> clear seems to make sense rather than trying to build code and have it

> deployed before even an initial result can be seen.

>

>

 

 

--

 

Adrian Schroeter

email: [hidden email]

 

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)

Maxfeldstraße 5

90409 Nürnberg

Germany